pith. sign in

arxiv: 2605.20356 · v1 · pith:OY4ZZWTCnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.SD

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

Pith reviewed 2026-05-21 07:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD
keywords full-duplex dialoguerepresentational synchronizationturn-taking predictionspeech dialogue modelsinternal state alignmentanticipatory cuescentered kernel alignment
0
0 comments X

The pith

Full-duplex speech dialogue models synchronize internal representations near zero lag and encode anticipatory turn-taking information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how full-duplex spoken dialogue models coordinate during simultaneous listening and speaking by simulating conversations between two instances of the same pretrained model. It applies alignment measurements across time lags to track how internal representations match up and uses causal probes on delayed states to check for early signals about when one side will stop speaking. The findings show peak alignment at the same moment without noise, with that alignment falling off as noise is added, plus reliable advance prediction of turn switches from the states. A sympathetic reader would care because this points to an automatic coordination process inside the models that could produce more natural, overlapping speech patterns instead of rigid back-and-forth exchanges. If the claim holds, designers could rely on these built-in dynamics rather than adding separate rules for turn management.

Core claim

The paper shows that representational synchronization, measured across temporal lags, reaches its highest point near zero lag in the absence of noise and weakens as channel noise is introduced. Internal states from both speaker and listener perspectives contain anticipatory information sufficient for predicting turn-taking events ahead of time.

What carries the argument

Centered Kernel Alignment applied to model activations at varying time lags to quantify synchronization, combined with causal LSTM probes on delayed activations to extract turn-taking predictions.

If this is right

  • Synchronization is strongest at zero lag and degrades steadily with added noise.
  • Internal states support turn-taking forecasts made before the actual change occurs.
  • The anticipatory signals appear from both the current speaker and listener viewpoints.
  • Changes in decoding bias alter the interaction patterns in addition to noise effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern generalizes, synchronization strength could become an additional training signal for improving dialogue naturalness.
  • Real-time monitoring of state alignment might allow systems to detect and correct coordination failures during live use.
  • The same anticipatory encoding could be tested in multi-agent setups that mix speech with other input types.

Load-bearing premise

Dialogues simulated between two identical copies of the same pretrained model under controlled noise and bias conditions reflect the coordination that would appear in real human-AI or multi-agent full-duplex exchanges.

What would settle it

Repeating the centered kernel alignment analysis and LSTM probes on dialogues that include actual human speech input or a different full-duplex model and observing no zero-lag synchronization peak or no above-chance advance turn prediction.

Figures

Figures reproduced from arXiv: 2605.20356 by Cristina Kuo, Marcelo Sancinetti, Pablo Brusco, Pablo Riera, S.R.K. Branavan.

Figure 1
Figure 1. Figure 1: Targets locations for the two turn-taking prediction tasks. End-of-IPU prediction is a continuous frame-level task, with positive labels only at true IPU-final frames (triangles). Hold vs. Non-Hold prediction occurs at discrete IPU bound￾aries, classifying whether the speaker will continue (Hold as cross) or transition (Non-Hold as circle). be made, defined as both speakers overlapping for more than 240 ms… view at source ↗
Figure 2
Figure 2. Figure 2: Linear CKA between the internal representations of the two agents for different lags and communication channel noise. Higher CKA values indicate stronger synchronization. Shaded regions represent 95% confidence intervals of the mean CKA value. IPU boundaries can be predicted from delayed activations, (ii) whether higher-level turn-management decisions such as hold vs. non-hold are encoded. For all predicti… view at source ↗
Figure 3
Figure 3. Figure 3: End-of-IPU prediction performance (AUC-ROC) across temporal delays and experimental conditions. Left and right panels correspond to production and perception settings, respectively. Top and bottom panels show noise vs. no-noise conditions and model type. Error bars represent 95% confi￾dence intervals. tuned version have a higher peak value. This may be because the fine-tuned model was exposed during traini… view at source ↗
read the original abstract

Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that full-duplex spoken dialogue models exhibit strong representational synchronization during simulated interactions between two instances of the pretrained Moshi model. Synchronization is quantified via Centered Kernel Alignment (CKA) across temporal lags under controlled channel noise and decoding bias; it peaks near zero lag with no noise and degrades as noise increases. Internal states are further shown to encode anticipatory information that supports turn-taking prediction ahead of time, as probed by causal LSTM models from both speaker and listener perspectives.

Significance. If the central measurements hold after appropriate controls, the work would offer an initial empirical window into internal coordination dynamics in full-duplex SDMs, paralleling neural coupling observations in human dialogue. The controlled noise/bias manipulations and dual-perspective probing constitute a concrete experimental protocol that could be extended to other models or real human-AI settings.

major comments (2)
  1. [Simulation Protocol] The simulation protocol (described in the methods) uses two identical copies of the same pretrained Moshi model. Because the agents share architecture, weights, and training data, the reported high CKA alignment (peaking near zero lag) could arise simply from both models processing correlated acoustic input or producing similar token distributions, rather than from any interaction-specific synchronization mechanism. No control condition with architecturally distinct models or non-interacting baselines is mentioned, so it is unclear whether the observed degradation with noise reflects loss of coordination or merely reduced input similarity.
  2. [Abstract and Results] The abstract states clear findings of 'strong representational synchronization' and anticipatory encoding, yet the provided text supplies no quantitative results (exact CKA values, number of dialogues, error bars, or statistical tests). Without these details it is impossible to assess the magnitude, reliability, or robustness of the central claims.
minor comments (2)
  1. [Methods] Clarify the precise definition and implementation of the causal LSTM probes (input features, lag ranges, training procedure) so that the anticipatory-turn-taking results can be reproduced.
  2. [Figures] Add explicit baseline CKA curves (e.g., shuffled dialogues or independent non-dialogue audio) to the lag plots so readers can judge how much of the zero-lag peak is attributable to dialogue dynamics versus shared input statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and outline planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Simulation Protocol] The simulation protocol (described in the methods) uses two identical copies of the same pretrained Moshi model. Because the agents share architecture, weights, and training data, the reported high CKA alignment (peaking near zero lag) could arise simply from both models processing correlated acoustic input or producing similar token distributions, rather than from any interaction-specific synchronization mechanism. No control condition with architecturally distinct models or non-interacting baselines is mentioned, so it is unclear whether the observed degradation with noise reflects loss of coordination or merely reduced input similarity.

    Authors: We agree that identical model instances introduce a potential confound from shared weights and architecture. The observed CKA degradation specifically with channel noise (rather than uniform reduction) provides evidence that synchronization depends on the quality of the exchanged signals during interaction. Nevertheless, to isolate interaction-specific coordination from input similarity, we will add a non-interacting baseline in the revised methods and results, in which the two models process independent audio streams without any channel exchange. We will also explicitly discuss the use of identical instances as a limitation and note that future work should test architecturally distinct models. revision: yes

  2. Referee: [Abstract and Results] The abstract states clear findings of 'strong representational synchronization' and anticipatory encoding, yet the provided text supplies no quantitative results (exact CKA values, number of dialogues, error bars, or statistical tests). Without these details it is impossible to assess the magnitude, reliability, or robustness of the central claims.

    Authors: The results section reports the quantitative details, including CKA values across lags and noise levels, the number of simulated dialogues, error bars from repeated runs, and associated statistical tests. To make the central claims more immediately assessable, we will revise the abstract to include key quantitative highlights (e.g., peak CKA value under no-noise conditions and number of dialogues) while preserving brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical measurements

full rationale

The paper reports an empirical study of representational synchronization in simulated full-duplex dialogues using two instances of the pretrained Moshi model, with synchronization quantified via CKA across lags and anticipatory cues probed via causal LSTMs. No mathematical derivation, first-principles prediction, or parameter-fitting step is described that reduces a claimed result to its own inputs by construction. The measurements are direct observations under manipulated noise and bias conditions rather than self-definitional or fitted-input predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify a derivation. The work is therefore self-contained as experimental reporting without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the work relies on the pretrained Moshi model and standard metrics.

pith-pipeline@v0.9.0 · 5672 in / 1140 out tokens · 25329 ms · 2026-05-21T07:40:54.651081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    Introduction Human conversation is characterized by rich temporal dynam- ics in which participants coordinate timing, mirror prosody, and anticipate turn changes [1, 2]. These phenomena include what is referred to asentrainment, the tendency of interlocutors to converge in speech rate, pitch, and syntactic structure, although they may also adapt by becomi...

  2. [2]

    Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

    Related Work Spoken Dialogue Systems (SDS) have shifted from rigid, turn-based half-duplex architectures toward synchronous, full- duplex interactions that more closely resemble human com- munication. Traditional cascaded pipelines for speech recog- nition, language modeling, and synthesis suffer from latency and unnatural turn-taking [8, 9]. To address t...

  3. [3]

    Moshika” for the agent role and “Moshiko

    Methods 3.1. Simulated Full-Duplex Dialogues Environment To study synchronization between internal representations and turn-taking dynamics under controlled conditions, we generate conversations between two independent instances of Moshi of 100 seconds. As the model can generate speech while consum- ing audio we connected two instances (A and B) through a...

  4. [4]

    All experiments are con- ducted in the simulated two-agent environment described in Section 3

    Experiment Results We design two sets of experiments to investigate synchro- nization and turn-taking dynamics. All experiments are con- ducted in the simulated two-agent environment described in Section 3. First we evaluate the emergence of synchronization under different communication conditions: (i) varying commu- nication channel noise levels, (ii) ch...

  5. [5]

    in- teractional health

    Discussion and Future Work Our findings demonstrate that internal representational synchro- nization in full-duplex models emerges as a computational ana- log to human neural coupling [5], showing high sensitivity to communication channel integrity. The alignment captured via CKA, combined with the probes’ ability to forecast turn tran- sitions with signi...

  6. [6]

    On the structure of speaker–auditor interaction dur- ing speaking turns,

    S. Duncan, “On the structure of speaker–auditor interaction dur- ing speaking turns,”Language in society, vol. 3, no. 2, pp. 161– 180, 1974

  7. [7]

    Universals and cultural variation in turn-taking in conversation,

    T. Stivers, N. J. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann, F. Rossano, J. P. De Ruiter, K.-E. Yoonet al., “Universals and cultural variation in turn-taking in conversation,”Proceedings of the National Academy of Sciences, vol. 106, no. 26, pp. 10 587–10 592, 2009

  8. [8]

    An empirical study of the effect of acoustic- prosodic entrainment on the perceived trustworthiness of conver- sational avatars,

    R. H. G ´alvez, A. Gravano, ˇS. Be ˇnuˇs, R. Levitan, M. Trnka, and J. Hirschberg, “An empirical study of the effect of acoustic- prosodic entrainment on the perceived trustworthiness of conver- sational avatars,”Speech Communication, vol. 124, pp. 46–67, 2020

  9. [9]

    Turn-taking in conversational systems and human- robot interaction: a review,

    G. Skantze, “Turn-taking in conversational systems and human- robot interaction: a review,”Computer Speech & Language, vol. 67, p. 101178, 2021

  10. [10]

    Speaker–listener neural coupling underlies successful communication,

    G. J. Stephens, L. J. Silbert, and U. Hasson, “Speaker–listener neural coupling underlies successful communication,”Proceed- ings of the national academy of sciences, vol. 107, no. 32, pp. 14 425–14 430, 2010

  11. [11]

    Time-delayed mutual information of the phase as a measure of functional connectivity,

    A. Wilmer, M. de Lussanet, and M. Lappe, “Time-delayed mutual information of the phase as a measure of functional connectivity,” 2012

  12. [12]

    Automatic offline annotation of turn- taking transitions in task-oriented dialogue,

    P. Brusco and A. Gravano, “Automatic offline annotation of turn- taking transitions in task-oriented dialogue,”Computer Speech & Language, vol. 78, p. 101462, 2023

  13. [13]

    Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

    C.-K. Yang, N. Ho, and H. yi Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,”ArXiv, vol. abs/2505.15957, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:278789464

  14. [14]

    Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems,

    Y . Peng, Y .-W. Chao, D. Ng, Y . Ma, C. Ni, B. Ma, and C. E. Siong, “Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems,” ArXiv, vol. abs/2507.19040, 2025. [Online]. Available: https: //api.semanticscholar.org/CorpusID:280046112

  15. [15]

    From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,

    Y . Chen and H. Yu, “From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,” ArXiv, vol. abs/2509.14515, 2025. [Online]. Available: https: //api.semanticscholar.org/CorpusID:281394631

  16. [16]

    Flexi: Benchmarking full-duplex human-llm speech interaction,

    Y . Ge, S. Chen, J. Xiao, X. Liu, T. Xiao, Y . Xiang, Z. Yu, and J. Zhu, “Flexi: Benchmarking full-duplex human-llm speech interaction,”ArXiv, vol. abs/2509.22243, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:281659329

  17. [17]

    Generative spoken dialogue language modeling,

    T. A. Nguyen, E. Kharitonov, J. Copet, Y . Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed et al., “Generative spoken dialogue language modeling,”Trans- actions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023

  18. [18]

    Towards general auditory intelligence: Large multimodal models for machine listening and speaking,

    S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuang, Y . Yang, M. Wang, M. Han, Y . Ding, J. Bai, T. Ouyang, S.-Y . Chang, X. Chen, X. Tian, J. Zhang, L. Lu, G. Sun, Z. Chen, J. Wu, B. Zhou, Y . Wang, T. N. Sainath, Y . Wu, and C. Zhang, “Towards general auditory intelligence: Large multimodal models for machine listening and...

  19. [19]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech- text foundation model for real-time dialogue,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00037

  20. [20]

    In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122

    G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

  21. [21]

    Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,

    S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,”CoRR, 2025

  22. [22]

    Similarity of neural network representations revisited,

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational confer- ence on machine learning. PMlR, 2019, pp. 3519–3529