arXiv preprint arXiv:2305.15255 , year=

· 2023 · arXiv 2305.15255

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars

cs.AI · 2026-06-29 · unverdicted · novelty 6.0

FacePlex introduces a unified streaming model with Rolling Flow Matching and Rolling Cross-Attention to enable full-duplex joint real-time generation of speech and facial motion tokens.

Audio Interaction Model

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.

Learning When to Think While Listening in Large Audio-Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

A wait-think-answer controller for LALMs is trained via SFT followed by six-reward DAPO, raising row-weighted accuracy from 67.6% to 70.3% and cutting post-endpoint thinking length by 14% on synthetic spoken QA while remaining functional on real recorded audio.

Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

eess.AS · 2026-06-10 · unverdicted · novelty 5.0

Empirical sweep finds 4.17 Hz frame rate plus intermediate-layer alignment optimal for speech QA under frozen text LLM backbone.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

cs.CL · 2025-08-25 · unverdicted · novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

eess.AS · 2026-03-18 · 2 refs

citing papers explorer

Showing 6 of 6 citing papers after filters.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 36
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars cs.AI · 2026-06-29 · unverdicted · none · ref 27
FacePlex introduces a unified streaming model with Rolling Flow Matching and Rolling Cross-Attention to enable full-duplex joint real-time generation of speech and facial motion tokens.
Audio Interaction Model cs.SD · 2026-06-03 · unverdicted · none · ref 20
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
Learning When to Think While Listening in Large Audio-Language Models cs.CL · 2026-05-26 · unverdicted · none · ref 46
A wait-think-answer controller for LALMs is trained via SFT followed by six-reward DAPO, raising row-weighted accuracy from 67.6% to 70.3% and cutting post-endpoint thinking length by 14% on synthetic spoken QA while remaining functional on real recorded audio.
Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation eess.AS · 2026-06-10 · unverdicted · none · ref 46
Empirical sweep finds 4.17 Hz frame rate plus intermediate-layer alignment optimal for speech QA under frozen text LLM backbone.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 38
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

arXiv preprint arXiv:2305.15255 , year=

fields

years

verdicts

representative citing papers

citing papers explorer