pith. sign in

Kame: Tandem architec- ture for enhancing knowledge in real-time speech-to-speech conversa- tional ai

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

citation-role summary

background 1

citation-polarity summary

fields

eess.AS 2

years

2026 2

verdicts

UNVERDICTED 2

roles

background 1

polarities

background 1

representative citing papers

Endpoint Anticipation for Low-Latency Spoken Dialogue

eess.AS · 2026-06-11 · unverdicted · novelty 5.0

A speech-based model forecasts conversation turn endpoints up to 2.56 seconds ahead to enable lower-latency spoken dialogue via speculative LLM and TTS execution.

A Survey of Audio Reasoning in Multimodal Foundation Models

eess.AS · 2026-05-20 · unverdicted · novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

citing papers explorer

Showing 2 of 2 citing papers.

  • Endpoint Anticipation for Low-Latency Spoken Dialogue eess.AS · 2026-06-11 · unverdicted · none · ref 26 · internal anchor

    A speech-based model forecasts conversation turn endpoints up to 2.56 seconds ahead to enable lower-latency spoken dialogue via speculative LLM and TTS execution.

  • A Survey of Audio Reasoning in Multimodal Foundation Models eess.AS · 2026-05-20 · unverdicted · none · ref 88 · internal anchor

    A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.