DeSRPA introduces a dual-level control vector method for inference-time intervention on frozen backbones to improve personality consistency and speech naturalness in role-playing agents over end-to-end fine-tuned baselines.
DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
While Large Language Models (LLMs) have revolutionized text-based role-playing, creating immersive Speech Role-Playing Agents (SRPAs) requires a seamless bridge between cognitive reasoning and paralinguistic nuances. Current SRPAs primarily rely on end-to-end (E2E) fine-tuning. However, this paradigm suffers from poor generalization to unseen characters due to its reliance on role-specific data, while imposing a "modality alignment tax" that degrades intrinsic LLM reasoning capabilities. We propose DeSRPA, an agentic framework for character role play via inference-time intervention on frozen backbones. DeSRPA employs a dual-level control vector mechanism, Internal Cognitive Steering and External Expressive Rendering, to synchronize "mind" and "voice". Experiments on SpeechRole and OmniCharacter benchmarks demonstrate that DeSRPA significantly outperforms E2E baselines in personality and emotional consistency. It achieves high speech naturalness, narrowing the gap with proprietary models like GPT-4o Audio, while remaining a scalable and training-free paradigm.
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention
DeSRPA introduces a dual-level control vector method for inference-time intervention on frozen backbones to improve personality consistency and speech naturalness in role-playing agents over end-to-end fine-tuned baselines.