arxiv: 2605.06765 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Jiacheng Xu , Heting Gao , Liufei Xie , Zhenchuan Yang , Lijiang Li , Yiting Chen , Bin Zhang , Meng Chen

show 3 more authors

Chaoyu Fu Weifeng Zhao Wenjiang Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords spoken language modelrole-playing speechsinging generationexpressive speechhybrid speech-text modelingmulti-codebook audio tokensend-to-end spoken generation

0 comments

The pith

VITA-QinYu is the first end-to-end spoken language model that generates role-playing speech and singing alongside natural conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VITA-QinYu as the first expressive spoken language model that extends beyond basic conversation to include role-playing with personality and mood, plus singing. It relies on a hybrid speech-text structure that interleaves text and audio tokens while using multiple codebooks for audio to keep the two modalities from interfering. A pipeline creates 15.8K hours of synthetic training data that mixes conversation, role-playing, and singing examples. The resulting model is shown to exceed earlier spoken language models on both new expressiveness tasks and standard conversation measures.

Core claim

VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens. This design supports richer paralinguistic representation while keeping a clear separation between modalities. The model is trained on 15.8K hours of synthesized natural conversation, role-playing, and singing data, allowing it to produce speech that conveys personality, mood, or performance elements.

What carries the argument

hybrid speech-text paradigm with multi-codebook audio tokens, which separates linguistic content from paralinguistic features such as tone and expression

If this is right

Speech output can now carry personality and mood for role-playing tasks inside one model.
Singing becomes possible in the same end-to-end system used for ordinary conversation.
Conversational accuracy and fluency remain at or above prior state-of-the-art levels.
Full-duplex streaming interaction becomes available for expressive spoken exchanges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of text and audio codebooks could be tested on other paralinguistic tasks such as emotional or accented speech.
Open-sourced streaming support may allow quick integration into dialogue systems that need both information and performance.
Scaling the synthetic data pipeline to narrower domains like storytelling or education could produce specialized expressive voices.

Load-bearing premise

The synthetic data pipeline produces training examples whose expressiveness and distribution closely match real human role-playing and singing speech.

What would settle it

Human listening tests on real, non-synthetic role-playing and singing recordings in which VITA-QinYu scores lower than a baseline spoken language model on naturalness or expressiveness.

Figures

Figures reproduced from arXiv: 2605.06765 by Bin Zhang, Chaoyu Fu, Heting Gao, Jiacheng Xu, Lijiang Li, Liufei Xie, Meng Chen, Weifeng Zhao, Wenjiang Zhou, Yiting Chen, Zhenchuan Yang.

**Figure 1.** Figure 1: Architecture overview of VITA-QINYU. For text input, the LLM directly consumes embeddings; for speech input, a speaker module extracts speaker embeddings and an audio encoder extracts continuous features. An additional agent speaker embedding controls response timbre. Conditioned on these signals, the LLM generates interleaved text and multi-codebook audio tokens. Audio tokens are temporally shifted for qu… view at source ↗

**Figure 2.** Figure 2: Logic of multi-turn conversation, agent speaker generation and interruption. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Data pipelines for natural conversation, role-playing and singing. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VITA-QinYu is the first E2E spoken LM built for role-playing and singing on top of conversation, using a hybrid multi-codebook architecture and 15.8K hours of synthetic data, but the reported gains depend on an unvalidated assumption that the synthetic examples match real human expressiveness.

read the letter

The main thing to know is that this paper builds the first end-to-end spoken language model that adds role-playing and singing as first-class tasks. It uses a hybrid interleaved text-audio setup with multi-codebook tokens to handle richer paralinguistic features while keeping modalities from interfering. They created a data pipeline to generate 15.8K hours of mixed conversation, role-play, and singing examples, then report gains of 7 points on role-playing benchmarks, 0.13 MOS on singing, and smaller lifts on C3 and URO conversational tests. They also open-source the code and models with a streaming demo.

Referee Report

2 major / 2 minor

Summary. The paper introduces VITA-QinYu as the first expressive end-to-end spoken language model (SLM) supporting role-playing and singing in addition to natural conversation. It employs a hybrid interleaved text-audio modeling approach with multi-codebook audio tokens to enable richer paralinguistic representations while avoiding modality interference. A custom data generation pipeline is used to synthesize 15.8K hours of training data covering natural conversation, role-playing, and singing. The model is reported to outperform peer SLMs by 7 percentage points on objective role-playing benchmarks, by 0.13 points on a 5-point MOS scale for singing, and to achieve state-of-the-art results on the C3 (+1.38 pp) and URO (+4.98 pp) conversational benchmarks. Code, models, and a streaming demo are open-sourced.

Significance. If the central claims hold after addressing the evaluation gaps, this would represent a meaningful advance in spoken language modeling by extending capabilities beyond standard conversation to expressive tasks like role-playing and singing. The hybrid multi-codebook architecture and large-scale synthetic data pipeline are potentially enabling contributions for paralinguistic modeling. The open-sourcing of code and models strengthens reproducibility and could accelerate follow-on work in interactive AI applications.

major comments (2)

[Method / Data Generation Pipeline] Data generation pipeline (described in the method section): The paper states that all training uses 15.8K hours of synthetically generated data but provides no quantitative validation of the pipeline's fidelity, such as human perceptual ratings, acoustic feature histograms (e.g., prosody, pitch, timbre distributions), or controlled real-vs-synthetic benchmark splits. This is load-bearing for the central claims because the reported 7 pp role-playing gain, 0.13 MOS singing improvement, and conversational SOTA results could arise from synthetic data artifacts or distribution shifts rather than the hybrid architecture.
[Experiments / Results] Experimental evaluation (§4 / Results): Benchmark improvements are presented (7 pp on role-playing, +1.38 pp C3, +4.98 pp URO, 0.13 MOS) without details on evaluation protocols, statistical significance testing, data splits, inter-rater reliability for MOS, error bars, or ablations isolating the multi-codebook design from data effects. This prevents assessment of whether the gains are robust or confounded by the synthetic training distribution.

minor comments (2)

[Abstract] The abstract refers to 'peer SLMs' and 'prior SLMs' without naming the specific baselines or providing citations; these should be explicitly listed with references in §4 and Table 1 or equivalent.
[Method] Notation for the multi-codebook audio tokens and hybrid interleaving scheme could be clarified with a diagram or pseudocode in the method section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns regarding validation of the synthetic data pipeline and transparency in experimental reporting are well-taken. We address each point below and will revise the manuscript to incorporate additional analyses and details for improved rigor.

read point-by-point responses

Referee: [Method / Data Generation Pipeline] Data generation pipeline (described in the method section): The paper states that all training uses 15.8K hours of synthetically generated data but provides no quantitative validation of the pipeline's fidelity, such as human perceptual ratings, acoustic feature histograms (e.g., prosody, pitch, timbre distributions), or controlled real-vs-synthetic benchmark splits. This is load-bearing for the central claims because the reported 7 pp role-playing gain, 0.13 MOS singing improvement, and conversational SOTA results could arise from synthetic data artifacts or distribution shifts rather than the hybrid architecture.

Authors: We agree that explicit quantitative validation of the data pipeline would strengthen the manuscript. The pipeline employs state-of-the-art TTS and voice conversion to generate expressive data, and downstream SOTA results provide indirect evidence of quality. In the revised version, we will add: acoustic feature histograms (pitch, prosody, timbre) comparing synthetic samples to real speech corpora; human perceptual ratings (naturalness and expressiveness) on a held-out subset of generated data; and evaluations on real-world test sets to demonstrate generalization beyond synthetic distributions. These additions will help rule out artifacts as the source of gains. revision: yes
Referee: [Experiments / Results] Experimental evaluation (§4 / Results): Benchmark improvements are presented (7 pp on role-playing, +1.38 pp C3, +4.98 pp URO, 0.13 MOS) without details on evaluation protocols, statistical significance testing, data splits, inter-rater reliability for MOS, error bars, or ablations isolating the multi-codebook design from data effects. This prevents assessment of whether the gains are robust or confounded by the synthetic training distribution.

Authors: We acknowledge the need for greater transparency in reporting. Section 4 describes the benchmarks and protocols, but we will expand it in revision to include: full details on data splits and evaluation procedures; statistical significance testing with p-values and confidence intervals for all reported improvements; error bars on objective metrics; inter-rater reliability (e.g., Krippendorff's alpha) for MOS scores; and ablations comparing multi-codebook vs. single-codebook variants trained on identical data to isolate architectural contributions from data effects. This will confirm the robustness of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SLM with external benchmark results

full rationale

The paper proposes a hybrid multi-codebook architecture and a synthetic data pipeline, then reports performance deltas on independent benchmarks (role-playing, MOS singing, C3, URO). No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental outcomes measured against external test sets rather than tautological renaming or input-output equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of deep learning (gradient descent convergence, tokenization fidelity) plus domain assumptions about audio token separability and synthetic data fidelity; no invented physical entities.

free parameters (2)

multi-codebook audio token configuration
Number and size of codebooks chosen to balance paralinguistic richness and modality separation; value not specified in abstract but required for the hybrid design.
data synthesis hyperparameters
Parameters controlling the 15.8K-hour data generation pipeline (voice cloning, role prompts, singing synthesis) are fitted or tuned to produce the training set.

axioms (2)

domain assumption Hybrid interleaved text-audio modeling with multi-codebook tokens preserves clear modality separation without interference.
Invoked to justify the architecture choice; no proof or ablation shown in abstract.
domain assumption Synthesized data distribution matches real human expressiveness for role-playing and singing.
Required for the training pipeline to generalize; central to all performance claims.

pith-pipeline@v0.9.0 · 5575 in / 1474 out tokens · 40403 ms · 2026-05-11T00:56:19.248232+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VITA-QINYU adopts a hybrid speech–text paradigm that extends interleaved text–audio modeling with multi-codebook audio tokens... eight 12.5 Hz codebooks (100 Hz total)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

141 extracted references · 70 canonical work pages · 19 internal anchors

[1]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The T05 System for The

Baba, Kaito and Nakata, Wataru and Saito, Yuki and Saruwatari, Hiroshi , booktitle =. The T05 System for The. 2024 , pages=

2024
[3]

2026 , url=

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning , author=. 2026 , url=

2026
[4]

ArXiv , year=

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models , author=. ArXiv , year=
[5]

GitHub repository , howpublished =

Silero-Team , title =. GitHub repository , howpublished =. 2024 , publisher =

2024
[6]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Cosyvoice 2: Scalable streaming speech synthesis with large language models , author=. arXiv preprint arXiv:2412.10117 , year=

work page internal anchor Pith review arXiv
[7]

Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens , author=. arXiv preprint arXiv:2503.01710 , year=

work page arXiv
[8]

Neural Networks , volume=

HiddenSinger: High-quality singing voice synthesis via neural audio codec and latent diffusion models , author=. Neural Networks , volume=. 2025 , publisher=

2025
[9]

arXiv preprint arXiv:2406.08416 , year=

Toksing: Singing voice synthesis based on discrete tokens , author=. arXiv preprint arXiv:2406.08416 , year=

work page arXiv
[10]

Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
[11]

International Conference on Machine Learning , pages=

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[12]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

2022
[13]

2025 , url =

TEN-Team , title =. 2025 , url =

2025
[14]

Step-audio-aqaa: a fully end-to- end expressive large audio language model,

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model , author=. arXiv preprint arXiv:2506.08967 , year=

work page arXiv
[15]

From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

From persona to personalization: A survey on role-playing language agents , author=. arXiv preprint arXiv:2404.18231 , year=

work page arXiv
[16]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317, 2022

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition , author=. arXiv preprint arXiv:2206.08317 , year=

work page arXiv
[17]

Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs,

XY-Tokenizer: Mitigating the Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs , author=. arXiv preprint arXiv:2506.23325 , year=

work page arXiv
[18]

Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051, 2024

Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms , author=. arXiv preprint arXiv:2407.04051 , year=

work page arXiv
[19]

arXiv preprint arXiv:2512.24618 , year=

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models , author=. arXiv preprint arXiv:2512.24618 , year=

work page arXiv
[20]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

C3: A bilingual benchmark for spoken dialogue models exploring challenges in complex conversations , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[21]

arXiv preprint arXiv:2510.11098 , year=

VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents , author=. arXiv preprint arXiv:2510.11098 , year=

work page arXiv
[22]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

Uro-bench: A comprehensive benchmark for end-to-end spoken dialogue models , author=. arXiv preprint arXiv:2502.17810 , year=

work page arXiv
[23]

2026 , eprint=

Synthetic Singers: A Review of Deep-Learning-based Singing Voice Synthesis Approaches , author=. 2026 , eprint=

2026
[24]

2025 , eprint=

OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction , author=. 2025 , eprint=

2025
[25]

2024 , eprint=

RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models , author=. 2024 , eprint=

2024
[26]

2023 , eprint=

ChatHaruhi: Reviving Anime Character in Reality via Large Language Model , author=. 2023 , eprint=

2023
[27]

2025 , eprint=

Kimi-Audio Technical Report , author=. 2025 , eprint=

2025
[28]

2025 , eprint=

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her , author=. 2025 , eprint=

2025
[29]

ArXiv , year=

MiMo-Audio: Audio Language Models are Few-Shot Learners , author=. ArXiv , year=
[30]

ArXiv , year=

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model , author=. ArXiv , year=
[31]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Semantic parsing on freebase from question-answer pairs , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

2013
[32]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension , author=. arXiv preprint arXiv:1705.03551 , year=

work page internal anchor Pith review arXiv
[33]

Transactions of the Association for Computational Linguistics , volume=

On generative spoken language modeling from raw audio , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

2021
[34]

High Fidelity Neural Audio Compression

High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=

work page internal anchor Pith review arXiv
[35]

Proceedings of ACL 2023

WebCPM: Interactive Web Search for Chinese Long-form Question Answering. Proceedings of ACL 2023. 2023

2023
[36]

arXiv preprint arXiv:2305.15255 , year=

Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM , author=. arXiv preprint arXiv:2305.15255 , year=

work page arXiv
[37]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al

Speechverse: A large-scale generalizable audio language model , author=. arXiv preprint arXiv:2405.08295 , year=

work page arXiv
[40]

Qwen2.5 Technical Report

Qwen2. 5 Technical Report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens , author=. arXiv preprint arXiv:2407.05407 , year=

work page arXiv
[42]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Audiogpt: Understanding and generating speech, music, sound, and talking head , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[43]

Xie and C

Mini-omni: Language models can hear, talk while thinking in streaming , author=. arXiv preprint arXiv:2408.16725 , year=

work page arXiv
[44]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities , author=. arXiv preprint arXiv:2410.11190 , year=

work page arXiv
[45]

A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Pro- cessing Systems, 37:13372–13403, 2024a

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm , author=. arXiv preprint arXiv:2411.00774 , year=

work page arXiv
[46]

Moshi: a speech-text foundation model for real-time dialogue

Moshi: a speech-text foundation model for real-time dialogue , author=. arXiv preprint arXiv:2410.00037 , year=

work page internal anchor Pith review arXiv
[47]

Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot , author=. arXiv preprint arXiv:2412.02612 , year=

work page arXiv
[48]

Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction , author=. arXiv preprint arXiv:2501.06282 , year=

work page arXiv
[49]

Transactions of the Association for Computational Linguistics , volume=

Spirit-lm: Interleaved spoken and written language model , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025
[50]

arXiv preprint arXiv:2411.17607 , year=

Scaling Speech-Text Pre-training with Synthetic Interleaved Data , author=. arXiv preprint arXiv:2411.17607 , year=

work page arXiv
[51]

Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024

Llama-omni: Seamless speech interaction with large language models , author=. arXiv preprint arXiv:2409.06666 , year=

work page arXiv
[52]

A full-duplex speech dialogue scheme based on large language models

A full-duplex speech dialogue scheme based on large language models , author=. arXiv preprint arXiv:2405.19487 , year=

work page arXiv
[53]

arXiv preprint arXiv:2412.15649 , year=

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training , author=. arXiv preprint arXiv:2412.15649 , year=

work page arXiv
[54]

Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation , year=

SNAC: Multi-Scale Neural Audio Codec , author=. Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation , year=

2024
[55]

Wavchat: A survey of spoken dialogue models

WavChat: A Survey of Spoken Dialogue Models , author=. arXiv preprint arXiv:2411.13577 , year=

work page arXiv
[56]

Qwen2-Audio Technical Report

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review arXiv
[57]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[58]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2020 , publisher=

2020
[59]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[60]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models , author=. arXiv preprint arXiv:2311.07919 , year=

work page internal anchor Pith review arXiv
[61]

Advances in Neural Information Processing Systems , volume=

Simple and controllable music generation , author=. Advances in Neural Information Processing Systems , volume=
[62]

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

Meld: A multimodal multi-party dataset for emotion recognition in conversations , author=. arXiv preprint arXiv:1810.02508 , year=

work page Pith review arXiv
[63]

arXiv preprint arXiv:2205.10237 , year=

M3ED: Multi-modal multi-scene multi-label emotional dialogue database , author=. arXiv preprint arXiv:2205.10237 , year=

work page arXiv
[64]

arXiv preprint arXiv:2406.07162 , year=

EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark , author=. arXiv preprint arXiv:2406.07162 , year=

work page arXiv
[65]

In International Conference on Human-Computer Interaction, 78–97

emotion2vec: Self-supervised pre-training for speech emotion representation , author=. arXiv preprint arXiv:2312.15185 , year=

work page arXiv
[66]

IEEE Journal of Selected Topics in Signal Processing , volume=

Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

2022
[67]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

A DevOps Domain Knowledge Evaluation Benchmarks for Large Language Models , howpublished =
[69]

Hinton and Simon Osindero and Yee Whye Teh , title =

Geoffrey E. Hinton and Simon Osindero and Yee Whye Teh , title =. Neural Computation , year =
[70]

Foundations and Trends in Machine Learning , year =

Yoshua Bengio , title =. Foundations and Trends in Machine Learning , year =
[71]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2010
[72]

Advances in neural information processing systems , pages=

Attention is all you need , author=. Advances in neural information processing systems , pages=
[73]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[74]

European conference on computer vision , pages=

End-to-end object detection with transformers , author=. European conference on computer vision , pages=. 2020 , organization=

2020
[75]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[76]

International Conference on Machine Learning , pages=

Training data-efficient image transformers & distillation through attention , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[77]

IEEE transactions on neural networks and learning systems , volume=

Object detection with deep learning: A review , author=. IEEE transactions on neural networks and learning systems , volume=. 2019 , publisher=

2019
[78]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

A convnet for the 2020s , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[79]

Vivit: A video vision transformer,

Vivit: A video vision transformer , author=. arXiv preprint arXiv:2103.15691 , year=

work page arXiv
[80]

IEEE transactions on pattern analysis and machine intelligence , volume=

Multimodal machine learning: A survey and taxonomy , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=

2018

Showing first 80 references.