Title resolution pending

Alexander H · 2025 · arXiv 2507.13264

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Ouvia is a user-centered evaluation framework for speech translation usability in real-world scenarios, showing limited usability rates and the superiority of QA-based metrics.

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

cs.SD · 2026-06-04 · unverdicted · novelty 7.0

SpeechJBB benchmark shows substantially high jailbreak success rates for LALMs on code-switched harmful audio, highest for non-English cases, with pseudo-word insertion further lowering refusal rates.

RealityTest: How People Probe AI Identity and Whether Models Disclose It

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

RealityTest is a human-grounded multilingual multimodal benchmark showing that only 31% of people ask AI identity directly and that suppression instructions plus question phrasing dominate disclosure behavior over model choice.

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

cs.AI · 2026-05-17 · unverdicted · novelty 7.0

CBT-Audio dataset shows that adding audio input improves distress intensity estimation over transcripts alone for 8 of 10 audio language models, with clearest gains when verbal content and vocal delivery diverge.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

eess.AS · 2026-04-28 · unverdicted · novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

cs.CL · 2026-01-14 · unverdicted · novelty 7.0

MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

cs.CL · 2026-06-11 · unverdicted · novelty 6.0

Introduces ontology memory-augmented ASR correction that organizes prior interaction history into retrievable nodes and reports gains over direct correction in 9 of 10 backbone-setting pairs on a new long-context dataset.

Benchmarking Speech-to-Speech Translation Models

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

COMPASS is a new reproducible benchmarking framework for S2ST that deploys 46 metrics on 1248 configurations, shows single-metric rankings mislead, reduces to 10 metrics per direction, and finds domain-specific metrics better match human judgments than standalone MOS predictors.

MURMUR: An Efficient Inference System for Long-Form ASR

cs.LG · 2026-05-31 · conditional · novelty 6.0

Murmur matches single-pass long-context ASR accuracy on AMI-IHM while cutting latency 4.2x by tuning chunk size and using intra-chunk attention sparsity via KV eviction.

All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation

cs.SD · 2026-04-27 · unverdicted · novelty 6.0

Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

eess.AS · 2026-04-09 · unverdicted · novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

cs.CV · 2026-04-05 · unverdicted · novelty 6.0

Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.

Voxtral Realtime

cs.AI · 2026-02-11 · unverdicted · novelty 6.0

Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

cs.CL · 2025-12-01 · conditional · novelty 6.0

MCAT scales MLLMs to many-to-many speech translation across 70 languages via curriculum learning and a 30-token speech adapter, surpassing prior SOTA on FLEURS while improving speed.

AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

cs.SD · 2025-09-09 · unverdicted · novelty 6.0

AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 5.0

GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.

Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.

Voxtral TTS

cs.AI · 2026-03-26 · unverdicted · novelty 5.0

Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

cs.SD · 2025-10-01 · unverdicted · novelty 5.0

Irrelevant audio including silence reduces accuracy and increases volatility in text reasoning for large audio-language models, with effects worsening at longer durations, higher amplitudes, and higher temperatures.

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

cs.SD · 2025-09-18 · unverdicted · novelty 5.0

Introduces XLSR-Thai encoder, U-Align alignment, and Thai-SUP data pipeline to enable multitask speech understanding SLLMs for Thai.

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

cs.CL · 2026-06-04 · unverdicted · novelty 4.0

FiLM speaker conditioning allows a SpeechLLM to adapt to pathological speakers competitively with fine-tuning while keeping general performance.

PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

eess.AS · 2026-05-19 · unverdicted · novelty 4.0

PlanRAG-Audio introduces planning-based retrieval-augmented generation to improve accuracy and stability of long-form audio understanding in LALMs by decoupling model input from raw audio duration.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer