arXiv preprint arXiv:2507.13264 , year=

Voxtral , author= · 2025 · arXiv 2507.13264

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

RealityTest: How People Probe AI Identity and Whether Models Disclose It

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RealityTest is a human-grounded multilingual multimodal benchmark showing that only 31% of people ask AI identity directly and that suppression instructions plus question phrasing dominate disclosure behavior over model choice.

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

cs.AI · 2026-05-17 · unverdicted · novelty 7.0

CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

cs.CL · 2026-01-14 · unverdicted · novelty 7.0

MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

Introduces ontology memory-augmented ASR correction that organizes prior interaction history into retrievable nodes and reports gains over direct correction in 9 of 10 backbone-setting pairs on a new long-context dataset.

Benchmarking Speech-to-Speech Translation Models

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

COMPASS is a new reproducible benchmarking framework for S2ST that deploys 46 metrics on 1248 configurations, shows single-metric rankings mislead, reduces to 10 metrics per direction, and finds domain-specific metrics better match human judgments than standalone MOS predictors.

MURMUR: An Efficient Inference System for Long-Form ASR

cs.LG · 2026-05-31 · conditional · novelty 6.0

Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

cs.SD · 2026-04-27 · unverdicted · novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

eess.AS · 2026-04-09 · unverdicted · novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

cs.CV · 2026-04-05 · unverdicted · novelty 6.0

Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.

Voxtral Realtime

cs.AI · 2026-02-11 · unverdicted · novelty 6.0

Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

cs.CL · 2025-12-01 · conditional · novelty 6.0

MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

cs.SD · 2025-09-09 · unverdicted · novelty 6.0

AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.

Voxtral TTS

cs.AI · 2026-03-26 · unverdicted · novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

cs.SD · 2025-10-01 · unverdicted · novelty 5.0

Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

cs.SD · 2025-09-18 · unverdicted · novelty 5.0

Introduces XLSR-Thai encoder, U-Align alignment, and Thai-SUP data pipeline to enable multitask speech understanding SLLMs for Thai.

PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

eess.AS · 2026-05-19 · unverdicted · novelty 4.0

PlanRAG-Audio introduces planning-based retrieval-augmented generation to improve accuracy and stability of long-form audio understanding in LALMs by decoupling model input from raw audio duration.

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

cs.AI · 2026-04-16 · unverdicted · novelty 4.0

A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

eess.AS · 2026-04-20

citing papers explorer

Showing 22 of 22 citing papers.

RealityTest: How People Probe AI Identity and Whether Models Disclose It cs.CL · 2026-05-29 · unverdicted · none · ref 30
RealityTest is a human-grounded multilingual multimodal benchmark showing that only 31% of people ask AI identity directly and that suppression instructions plus question phrasing dominate disclosure behavior over model choice.
CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings cs.AI · 2026-05-17 · unverdicted · none · ref 28
CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models eess.AS · 2026-04-28 · unverdicted · none · ref 10
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 11
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus cs.CL · 2026-01-14 · unverdicted · none · ref 10
MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 57
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations cs.CL · 2026-06-11 · unverdicted · none · ref 17
Introduces ontology memory-augmented ASR correction that organizes prior interaction history into retrievable nodes and reports gains over direct correction in 9 of 10 backbone-setting pairs on a new long-context dataset.
Benchmarking Speech-to-Speech Translation Models cs.CL · 2026-06-02 · unverdicted · none · ref 58
COMPASS is a new reproducible benchmarking framework for S2ST that deploys 46 metrics on 1248 configurations, shows single-metric rankings mislead, reduces to 10 metrics per direction, and finds domain-specific metrics better match human judgments than standalone MOS predictors.
MURMUR: An Efficient Inference System for Long-Form ASR cs.LG · 2026-05-31 · conditional · none · ref 13
Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation cs.SD · 2026-04-27 · unverdicted · none · ref 17
Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs eess.AS · 2026-04-09 · unverdicted · none · ref 18
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning cs.CV · 2026-04-05 · unverdicted · none · ref 8
Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.
Voxtral Realtime cs.AI · 2026-02-11 · unverdicted · none · ref 13
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages cs.CL · 2025-12-01 · conditional · none · ref 18
MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.
AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs cs.SD · 2025-09-09 · unverdicted · none · ref 9
AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps cs.CL · 2026-04-21 · unverdicted · none · ref 16
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
Voxtral TTS cs.AI · 2026-03-26 · unverdicted · none · ref 11
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models cs.SD · 2025-10-01 · unverdicted · none · ref 9
Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages cs.SD · 2025-09-18 · unverdicted · none · ref 16
Introduces XLSR-Thai encoder, U-Align alignment, and Thai-SUP data pipeline to enable multitask speech understanding SLLMs for Thai.
PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding eess.AS · 2026-05-19 · unverdicted · none · ref 5
PlanRAG-Audio introduces planning-based retrieval-augmented generation to improve accuracy and stability of long-form audio understanding in LALMs by decoupling model input from raw audio duration.
Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference cs.AI · 2026-04-16 · unverdicted · none · ref 14
A quantized int4 version of Nemotron ASR runs faster than real-time on CPU at 8.20% WER and 0.67 GB size, setting a new efficiency point for on-device streaming speech recognition.
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR eess.AS · 2026-04-20 · unreviewed · ref 13

arXiv preprint arXiv:2507.13264 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer