hub Canonical reference

Mimo-audio: Audio language models are few-shot learners

· 2025 · arXiv 2512.23808

Canonical reference. 75% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 75% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 baseline 2

citation-polarity summary

background 6 baseline 2

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

eess.AS · 2026-04-24 · unverdicted · novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

eess.AS · 2026-05-12 · unverdicted · novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation

eess.AS · 2026-05-09 · unverdicted · novelty 6.0

L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Qwen3.5-Omni Technical Report

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

eess.AS · 2026-04-14 · unverdicted · novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

Step-Audio-R1.5 Technical Report

eess.AS · 2026-04-28 · unverdicted · novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.

citing papers explorer

Showing 15 of 15 citing papers.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 33
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 101
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding eess.AS · 2026-04-24 · unverdicted · none · ref 40
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation cs.CL · 2026-04-22 · unverdicted · none · ref 38
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 33
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 19
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
TiCo: Time-Controllable Spoken Dialogue Model cs.CL · 2026-03-23 · unverdicted · none · ref 73
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models cs.CR · 2026-05-18 · unverdicted · none · ref 56
AIA generates universal interference audio infused with Acoustic Latent Semantics to bypass LALM safety alignment, achieving SOTA attack success rates on 10 models across five datasets.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 28
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Reducing Linguistic Hallucination in LM-Based Speech Enhancement via Noise-Invariant Acoustic-Semantic Distillation eess.AS · 2026-05-09 · unverdicted · none · ref 64
L3-SE reduces linguistic hallucination in LM-based speech enhancement by distilling noise-invariant acoustic-semantic representations from noisy inputs to condition an autoregressive decoder-only language model.
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis cs.CL · 2026-04-24 · unverdicted · none · ref 37
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 60
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 48
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models eess.AS · 2026-04-14 · unverdicted · none · ref 15
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Step-Audio-R1.5 Technical Report eess.AS · 2026-04-28 · unverdicted · none · ref 4
Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.

Mimo-audio: Audio language models are few-shot learners

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer