Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models
Pith reviewed 2026-05-21 07:40 UTC · model grok-4.3
The pith
Full-duplex speech dialogue models synchronize internal representations near zero lag and encode anticipatory turn-taking information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that representational synchronization, measured across temporal lags, reaches its highest point near zero lag in the absence of noise and weakens as channel noise is introduced. Internal states from both speaker and listener perspectives contain anticipatory information sufficient for predicting turn-taking events ahead of time.
What carries the argument
Centered Kernel Alignment applied to model activations at varying time lags to quantify synchronization, combined with causal LSTM probes on delayed activations to extract turn-taking predictions.
If this is right
- Synchronization is strongest at zero lag and degrades steadily with added noise.
- Internal states support turn-taking forecasts made before the actual change occurs.
- The anticipatory signals appear from both the current speaker and listener viewpoints.
- Changes in decoding bias alter the interaction patterns in addition to noise effects.
Where Pith is reading between the lines
- If the pattern generalizes, synchronization strength could become an additional training signal for improving dialogue naturalness.
- Real-time monitoring of state alignment might allow systems to detect and correct coordination failures during live use.
- The same anticipatory encoding could be tested in multi-agent setups that mix speech with other input types.
Load-bearing premise
Dialogues simulated between two identical copies of the same pretrained model under controlled noise and bias conditions reflect the coordination that would appear in real human-AI or multi-agent full-duplex exchanges.
What would settle it
Repeating the centered kernel alignment analysis and LSTM probes on dialogues that include actual human speech input or a different full-duplex model and observing no zero-lag synchronization peak or no above-chance advance turn prediction.
Figures
read the original abstract
Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that full-duplex spoken dialogue models exhibit strong representational synchronization during simulated interactions between two instances of the pretrained Moshi model. Synchronization is quantified via Centered Kernel Alignment (CKA) across temporal lags under controlled channel noise and decoding bias; it peaks near zero lag with no noise and degrades as noise increases. Internal states are further shown to encode anticipatory information that supports turn-taking prediction ahead of time, as probed by causal LSTM models from both speaker and listener perspectives.
Significance. If the central measurements hold after appropriate controls, the work would offer an initial empirical window into internal coordination dynamics in full-duplex SDMs, paralleling neural coupling observations in human dialogue. The controlled noise/bias manipulations and dual-perspective probing constitute a concrete experimental protocol that could be extended to other models or real human-AI settings.
major comments (2)
- [Simulation Protocol] The simulation protocol (described in the methods) uses two identical copies of the same pretrained Moshi model. Because the agents share architecture, weights, and training data, the reported high CKA alignment (peaking near zero lag) could arise simply from both models processing correlated acoustic input or producing similar token distributions, rather than from any interaction-specific synchronization mechanism. No control condition with architecturally distinct models or non-interacting baselines is mentioned, so it is unclear whether the observed degradation with noise reflects loss of coordination or merely reduced input similarity.
- [Abstract and Results] The abstract states clear findings of 'strong representational synchronization' and anticipatory encoding, yet the provided text supplies no quantitative results (exact CKA values, number of dialogues, error bars, or statistical tests). Without these details it is impossible to assess the magnitude, reliability, or robustness of the central claims.
minor comments (2)
- [Methods] Clarify the precise definition and implementation of the causal LSTM probes (input features, lag ranges, training procedure) so that the anticipatory-turn-taking results can be reproduced.
- [Figures] Add explicit baseline CKA curves (e.g., shuffled dialogues or independent non-dialogue audio) to the lag plots so readers can judge how much of the zero-lag peak is attributable to dialogue dynamics versus shared input statistics.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Simulation Protocol] The simulation protocol (described in the methods) uses two identical copies of the same pretrained Moshi model. Because the agents share architecture, weights, and training data, the reported high CKA alignment (peaking near zero lag) could arise simply from both models processing correlated acoustic input or producing similar token distributions, rather than from any interaction-specific synchronization mechanism. No control condition with architecturally distinct models or non-interacting baselines is mentioned, so it is unclear whether the observed degradation with noise reflects loss of coordination or merely reduced input similarity.
Authors: We agree that identical model instances introduce a potential confound from shared weights and architecture. The observed CKA degradation specifically with channel noise (rather than uniform reduction) provides evidence that synchronization depends on the quality of the exchanged signals during interaction. Nevertheless, to isolate interaction-specific coordination from input similarity, we will add a non-interacting baseline in the revised methods and results, in which the two models process independent audio streams without any channel exchange. We will also explicitly discuss the use of identical instances as a limitation and note that future work should test architecturally distinct models. revision: yes
-
Referee: [Abstract and Results] The abstract states clear findings of 'strong representational synchronization' and anticipatory encoding, yet the provided text supplies no quantitative results (exact CKA values, number of dialogues, error bars, or statistical tests). Without these details it is impossible to assess the magnitude, reliability, or robustness of the central claims.
Authors: The results section reports the quantitative details, including CKA values across lags and noise levels, the number of simulated dialogues, error bars from repeated runs, and associated statistical tests. To make the central claims more immediately assessable, we will revise the abstract to include key quantitative highlights (e.g., peak CKA value under no-noise conditions and number of dialogues) while preserving brevity. revision: yes
Circularity Check
No significant circularity in empirical measurements
full rationale
The paper reports an empirical study of representational synchronization in simulated full-duplex dialogues using two instances of the pretrained Moshi model, with synchronization quantified via CKA across lags and anticipatory cues probed via causal LSTMs. No mathematical derivation, first-principles prediction, or parameter-fitting step is described that reduces a claimed result to its own inputs by construction. The measurements are direct observations under manipulated noise and bias conditions rather than self-definitional or fitted-input predictions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify a derivation. The work is therefore self-contained as experimental reporting without circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Human conversation is characterized by rich temporal dynam- ics in which participants coordinate timing, mirror prosody, and anticipate turn changes [1, 2]. These phenomena include what is referred to asentrainment, the tendency of interlocutors to converge in speech rate, pitch, and syntactic structure, although they may also adapt by becomi...
-
[2]
Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models
Related Work Spoken Dialogue Systems (SDS) have shifted from rigid, turn-based half-duplex architectures toward synchronous, full- duplex interactions that more closely resemble human com- munication. Traditional cascaded pipelines for speech recog- nition, language modeling, and synthesis suffer from latency and unnatural turn-taking [8, 9]. To address t...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Moshika” for the agent role and “Moshiko
Methods 3.1. Simulated Full-Duplex Dialogues Environment To study synchronization between internal representations and turn-taking dynamics under controlled conditions, we generate conversations between two independent instances of Moshi of 100 seconds. As the model can generate speech while consum- ing audio we connected two instances (A and B) through a...
-
[4]
All experiments are con- ducted in the simulated two-agent environment described in Section 3
Experiment Results We design two sets of experiments to investigate synchro- nization and turn-taking dynamics. All experiments are con- ducted in the simulated two-agent environment described in Section 3. First we evaluate the emergence of synchronization under different communication conditions: (i) varying commu- nication channel noise levels, (ii) ch...
work page 1920
-
[5]
Discussion and Future Work Our findings demonstrate that internal representational synchro- nization in full-duplex models emerges as a computational ana- log to human neural coupling [5], showing high sensitivity to communication channel integrity. The alignment captured via CKA, combined with the probes’ ability to forecast turn tran- sitions with signi...
-
[6]
On the structure of speaker–auditor interaction dur- ing speaking turns,
S. Duncan, “On the structure of speaker–auditor interaction dur- ing speaking turns,”Language in society, vol. 3, no. 2, pp. 161– 180, 1974
work page 1974
-
[7]
Universals and cultural variation in turn-taking in conversation,
T. Stivers, N. J. Enfield, P. Brown, C. Englert, M. Hayashi, T. Heinemann, G. Hoymann, F. Rossano, J. P. De Ruiter, K.-E. Yoonet al., “Universals and cultural variation in turn-taking in conversation,”Proceedings of the National Academy of Sciences, vol. 106, no. 26, pp. 10 587–10 592, 2009
work page 2009
-
[8]
R. H. G ´alvez, A. Gravano, ˇS. Be ˇnuˇs, R. Levitan, M. Trnka, and J. Hirschberg, “An empirical study of the effect of acoustic- prosodic entrainment on the perceived trustworthiness of conver- sational avatars,”Speech Communication, vol. 124, pp. 46–67, 2020
work page 2020
-
[9]
Turn-taking in conversational systems and human- robot interaction: a review,
G. Skantze, “Turn-taking in conversational systems and human- robot interaction: a review,”Computer Speech & Language, vol. 67, p. 101178, 2021
work page 2021
-
[10]
Speaker–listener neural coupling underlies successful communication,
G. J. Stephens, L. J. Silbert, and U. Hasson, “Speaker–listener neural coupling underlies successful communication,”Proceed- ings of the national academy of sciences, vol. 107, no. 32, pp. 14 425–14 430, 2010
work page 2010
-
[11]
Time-delayed mutual information of the phase as a measure of functional connectivity,
A. Wilmer, M. de Lussanet, and M. Lappe, “Time-delayed mutual information of the phase as a measure of functional connectivity,” 2012
work page 2012
-
[12]
Automatic offline annotation of turn- taking transitions in task-oriented dialogue,
P. Brusco and A. Gravano, “Automatic offline annotation of turn- taking transitions in task-oriented dialogue,”Computer Speech & Language, vol. 78, p. 101462, 2023
work page 2023
-
[13]
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
C.-K. Yang, N. Ho, and H. yi Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,”ArXiv, vol. abs/2505.15957, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:278789464
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems,
Y . Peng, Y .-W. Chao, D. Ng, Y . Ma, C. Ni, B. Ma, and C. E. Siong, “Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems,” ArXiv, vol. abs/2507.19040, 2025. [Online]. Available: https: //api.semanticscholar.org/CorpusID:280046112
-
[15]
From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,
Y . Chen and H. Yu, “From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models,” ArXiv, vol. abs/2509.14515, 2025. [Online]. Available: https: //api.semanticscholar.org/CorpusID:281394631
-
[16]
Flexi: Benchmarking full-duplex human-llm speech interaction,
Y . Ge, S. Chen, J. Xiao, X. Liu, T. Xiao, Y . Xiang, Z. Yu, and J. Zhu, “Flexi: Benchmarking full-duplex human-llm speech interaction,”ArXiv, vol. abs/2509.22243, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:281659329
-
[17]
Generative spoken dialogue language modeling,
T. A. Nguyen, E. Kharitonov, J. Copet, Y . Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed et al., “Generative spoken dialogue language modeling,”Trans- actions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023
work page 2023
-
[18]
Towards general auditory intelligence: Large multimodal models for machine listening and speaking,
S. Wang, Z. Jin, C. Tang, Q. Li, B. Li, C. Chen, Y . Hu, W. Yu, Y . Li, J. Zhuang, Y . Yang, M. Wang, M. Han, Y . Ding, J. Bai, T. Ouyang, S.-Y . Chang, X. Chen, X. Tian, J. Zhang, L. Lu, G. Sun, Z. Chen, J. Wu, B. Zhou, Y . Wang, T. N. Sainath, Y . Wu, and C. Zhang, “Towards general auditory intelligence: Large multimodal models for machine listening and...
work page 2025
-
[19]
Moshi: a speech-text foundation model for real-time dialogue
A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech- text foundation model for real-time dialogue,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00037
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122
G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025
-
[21]
Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,
S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talk- ing turns: Benchmarking audio foundation models on turn-taking dynamics,”CoRR, 2025
work page 2025
-
[22]
Similarity of neural network representations revisited,
S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” inInternational confer- ence on machine learning. PMlR, 2019, pp. 3519–3529
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.