Vita-audio: Fast interleaved cross-modal to- ken generation for efficient large speech-language model

· 2025 · arXiv 2505.03739

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

LiveGesture Streamable Co-Speech Gesture Generation Model

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline SOTA on BEAT2.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 5.0

DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

citing papers explorer

Showing 6 of 6 citing papers.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 100
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection cs.CR · 2026-04-16 · unverdicted · none · ref 31
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 20
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
LiveGesture Streamable Co-Speech Gesture Generation Model cs.CV · 2026-04-13 · unverdicted · none · ref 26
LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline SOTA on BEAT2.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action eess.AS · 2026-05-20 · unverdicted · none · ref 27
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 131
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Vita-audio: Fast interleaved cross-modal to- ken generation for efficient large speech-language model

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer