CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.
V oxtral.arXiv preprint arXiv:2507.13264, 2025
17 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
Introduces XLSR-Thai encoder, U-Align alignment, and Thai-SUP data pipeline to enable multitask speech understanding SLLMs for Thai.
NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.
citing papers explorer
-
CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings
CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.
-
Voxtral Realtime
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
-
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages
MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
-
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
-
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
-
Voxtral TTS
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
-
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
-
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
Introduces XLSR-Thai encoder, U-Align alignment, and Thai-SUP data pipeline to enable multitask speech understanding SLLMs for Thai.
-
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
-
Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference
A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.