VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
hub
Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 11roles
method 2polarities
use method 2representative citing papers
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.
LLM system with paralinguistic cue injection and auxiliary reasoning training reaches 0.802 macro F1 and 0.805 accuracy on three-class speech-based crisis level classification under 5-fold cross-validation.
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.
citing papers explorer
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
-
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
-
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.
-
Speech-based Psychological Crisis Assessment using LLMs
LLM system with paralinguistic cue injection and auxiliary reasoning training reaches 0.802 macro F1 and 0.805 accuracy on three-class speech-based crisis level classification under 5-fold cross-validation.
-
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
-
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering
CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.
-
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.