arxiv: 2604.12928 · v3 · submitted 2026-04-14 · 💻 cs.CL · eess.AS

Recognition: no theorem link

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien , Manu Orsini , Eugene Kharitonov , Neil Zeghidour , Karen Livescu , Alexandre D\'efossez

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:44 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords full-duplex speech modelsasynchronous retrievalknowledge groundingfactuality improvementconversational AIspeech-to-speech modelsmodular RAGreal-time interactivity

0 comments

The pith

MoshiRAG lets full-duplex speech models retrieve external knowledge during response delivery to reach non-duplex factuality levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MoshiRAG as a modular addition to full-duplex speech-to-speech models that identifies knowledge needs and pulls in external facts asynchronously. Full-duplex systems excel at natural turn-taking and interruptions but often fall short on accuracy compared with slower, non-duplex alternatives. By treating the brief interval between starting a reply and stating its key content as retrieval time, the framework grounds answers without introducing noticeable delays or breaking conversational flow. The design works with existing retrieval components in a plug-and-play way and extends to out-of-domain tasks such as mathematical reasoning. If successful, this removes the need to scale model size for better facts while preserving the real-time interactivity that defines full-duplex conversation.

Core claim

MoshiRAG combines a compact full-duplex interface with selective, asynchronous retrieval so the model can detect when external information is required and complete the lookup in the natural pause between response onset and core information delivery, thereby matching the factuality of leading non-duplex speech models while retaining full interactivity and supporting plug-and-play retrieval modules without retraining.

What carries the argument

The asynchronous retrieval framework that runs in the temporal gap between spoken response start and delivery of factual content.

If this is right

Compact full-duplex models can ground their outputs in large external knowledge bases without increasing inference cost.
Retrieval components can be exchanged or upgraded independently of the speech model.
Performance on out-of-domain reasoning tasks such as math improves without additional training.
Handling of interruptions and backchannels remains unchanged because retrieval occurs in parallel with speech generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smaller speech models could close the accuracy gap with much larger ones by offloading knowledge lookup rather than embedding everything internally.
The same gap-based timing might support retrieval of longer or multi-hop facts if conversation turns allow slightly longer buffers.
Extending the approach to video or multimodal inputs could let real-time agents fetch visual or textual context mid-response.

Load-bearing premise

The natural interval between beginning a spoken reply and needing to state accurate facts is long enough for retrieval to finish without forcing pauses or unnatural delivery.

What would settle it

Record a set of queries that require a specific fact in the first few words of the answer; measure whether MoshiRAG produces accurate speech immediately or introduces delays or errors compared with a non-retrieval baseline.

Figures

Figures reproduced from arXiv: 2604.12928 by Alexandre D\'efossez, Chung-Ming Chien, Eugene Kharitonov, Karen Livescu, Manu Orsini, Neil Zeghidour.

**Figure 1.** Figure 1: Illustration of turn-based models versus full-duplex models. The former must explicitly switch between speaking and listening states, while the latter can concurrently speak and listen. versation experience, allowing users to communicate with AI systems as if they were speaking to a real human assistant. Earlier approaches typically combined multiple components – such as automatic speech recognition (ASR),… view at source ↗

**Figure 2.** Figure 2: ). Leveraging this observation, we design specialized fine-tuning data that trains Moshi to predict a retrieval trigger signal when the user poses knowledge-intensive queries. This signal asynchronously invokes an information retrieval system to generate reference documents relevant to the conversation context. The retrieved information is then incorporated into the response generation process before the … view at source ↗

**Figure 3.** Figure 3: Illustration of the front-end and back-end components in MoshiR A G . When the model needs external information, it outputs a ⟨ret⟩ token. The conversation transcript is sent to the back end which operates asynchronously. Once ready, the result is injected into Moshi which then adapts its response with no interruption. chronous information retrieval system operates in parallel as the back end. Additionally… view at source ↗

**Figure 4.** Figure 4: Text and audio token streams of the inputs and outputs of MoshiR A G . Front-end Moshi receives at all time its previous step token predictions and the user speech tokens. When the retrieval result is ready, its representation is summed with the embeddings from other token streams and ingested over a number of time steps. with a reference text encoder and then injected into Moshi. This allows the model to … view at source ↗

**Figure 5.** Figure 5: illustrates the distributions of retrieval delay during training and inference. While the two distributions overlap, the broader training-time distribution exposes the model to edge cases that potentially enhance robustness. It also shows that inference-time retrieval delays are almost always shorter than the keyword delay, confirming that the timing constraint described in Section 3.1 is almost always sa… view at source ↗

**Figure 6.** Figure 6: Analysis of MoshiR A G performance under different speech intelligibility and retrieval delay conditions. and (d) whether the retrieved information is successfully integrated and effectively improves MoshiR A G ’s final response. Since the results in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes MoshiRAG, a modular asynchronous retrieval-augmented generation framework for full-duplex speech-to-speech language models. It pairs a compact full-duplex interface with selective external retrieval, identifying knowledge-demanding queries and completing retrieval during the natural temporal gap between response onset and core-information delivery. The central claims are that this yields factuality comparable to the best publicly released non-duplex speech models, preserves full-duplex interactivity (interruptions, backchannels), supports plug-and-play retrieval without retraining, and delivers strong out-of-domain performance on mathematical reasoning tasks.

Significance. If the empirical results hold, the work would be significant for conversational AI: it decouples factuality gains from model scaling (which is prohibitive for real-time inference) and offers a modular route to grounding full-duplex systems in external knowledge while retaining their distinctive interactivity properties.

major comments (2)

[Abstract] Abstract: the claim that MoshiRAG 'achieves factuality comparable to the best publicly released non-duplex speech language models' is presented without any metrics, baselines, error bars, or experimental setup, which is load-bearing for the primary contribution.
[Abstract] Abstract / method description: the assertion that retrieval can be completed 'while maintaining a natural conversation flow' by leveraging the 'natural temporal gap' supplies no quantitative characterization of the gap distribution across real dialogues or of end-to-end retrieval latency under the chosen plug-and-play retriever; without these bounds the preservation of full-duplex interactivity remains unverified.

minor comments (1)

[Abstract] The abstract states 'strong performance on out-of-domain mathematical reasoning tasks' yet provides no task definitions, metrics, or comparison tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that MoshiRAG 'achieves factuality comparable to the best publicly released non-duplex speech language models' is presented without any metrics, baselines, error bars, or experimental setup, which is load-bearing for the primary contribution.

Authors: The abstract summarizes results detailed in the Experiments section, which includes factuality metrics on relevant benchmarks, comparisons to non-duplex baselines, and error bars from multiple runs. To address the concern directly in the abstract, we have revised it to include a concise quantitative statement of the key factuality result and primary baseline. revision: yes
Referee: [Abstract] Abstract / method description: the assertion that retrieval can be completed 'while maintaining a natural conversation flow' by leveraging the 'natural temporal gap' supplies no quantitative characterization of the gap distribution across real dialogues or of end-to-end retrieval latency under the chosen plug-and-play retriever; without these bounds the preservation of full-duplex interactivity remains unverified.

Authors: We agree that explicit bounds strengthen the claim. The revised manuscript now includes a characterization of temporal gap distributions drawn from real dialogue corpora (with mean and percentile statistics) alongside measured end-to-end latencies for the plug-and-play retriever, confirming that retrieval completes within available gaps in the large majority of cases. revision: yes

Circularity Check

0 steps flagged

No circularity: modular architecture with no derivations or self-referential claims

full rationale

The paper presents MoshiRAG as a modular, plug-and-play system that combines an existing full-duplex model with asynchronous external retrieval, without any equations, parameter fittings, or mathematical derivations. The central claim—that factuality matches non-duplex models while preserving interactivity—is framed as an empirical outcome of leveraging a natural temporal gap in conversation, not as a result derived by construction from the method's own inputs. No steps match the enumerated circularity patterns such as self-definitional relations, fitted inputs renamed as predictions, or load-bearing self-citations; the description remains self-contained as an engineering integration without reducing to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from speech and retrieval literature plus the paper-specific premise that query detection and timing gaps suffice for background retrieval.

axioms (2)

domain assumption Full-duplex models can reliably identify knowledge-demanding queries during ongoing speech.
Invoked in the description of the asynchronous framework.
domain assumption External retrieval can complete within the natural pause before core factual content is spoken.
Central timing assumption stated in the abstract.

pith-pipeline@v0.9.0 · 5510 in / 1244 out tokens · 43535 ms · 2026-05-13T00:44:15.751837+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

For example: - Numbers can be expressed in either Arabic numerals or words - Differences in punctuation or simple spelling mistakes can be ignored

The expression of answers can be flexible, not requiring exact matches. For example: - Numbers can be expressed in either Arabic numerals or words - Differences in punctuation or simple spelling mistakes can be ignored

work page
[2]

[]” format and make sure it contains “the score is [Correct]

Focus on whether the core meaning of the answer is correct ## Output Format Provide the reasoning for your score, then generate the result in “[]” format and make sure it contains “the score is [Correct]” or “the score is [Incorrect]”, for example: The answer is correct and equivalent to the standard answer, the score is [Correct] or The answer is incorre...

work page
[3]

Compare the value in the model’s answer to the Correct Answer

work page
[4]

Ignore rounding errors or other small differences

The comparison must be based on numerical equivalence (e.g., 5.0 should match 5). Ignore rounding errors or other small differences

work page
[5]

Yes” or “No

Your response must be only the word “Yes” or “No”. Example 1: Correct Match Question: If a train travels at 60 mph for 2 hours, how far does it travel? Correct Answer: 120.0 Model Answer: The total distance is 120 miles. Response: Yes Example 2: Incorrect Match Question: John had 10 apples and ate 3. How many are left? Correct Answer: 7 Model Answer: He h...

work page