TurnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue

Georgi Tinchev; Hao Zhang; Laureano Moro-Velazquez; Thomas Thebaud; Venkatesh Ravichandran

arxiv: 2607.01345 · v1 · pith:LACMJKAMnew · submitted 2026-07-01 · 💻 cs.CL · cs.AI

TurnNat: Automatic Evaluation of Turn-Taking Naturalness in Dyadic Spoken Dialogue

Hao Zhang , Thomas Thebaud , Georgi Tinchev , Venkatesh Ravichandran , Laureano Moro-Velazquez This is my paper

Pith reviewed 2026-07-03 21:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords turn-taking naturalnessspoken dialogue evaluationvoice activity predictionlikelihood-based metricdyadic dialogueperturbation benchmarkautomatic evaluationfull-duplex dialogue

0 comments

The pith

TurnNat scores turn-taking naturalness by negative log-likelihood of observed voice-activity states under a model trained on natural conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Turn-taking naturalness is central to full-duplex spoken dialogue systems, yet automatic evaluation has been limited because existing methods rely on human judgments or narrow timing metrics that cannot compare different kinds of failures. TurnNat trains a causal model on natural conversations to predict future two-speaker voice-activity states, then treats the negative log-likelihood of the actual observed states as a measure of timing atypicality. These frame-level scores are pooled over turn-taking boundary units taken from utterance onsets and offsets, and the mean and tail values are aggregated into a dialogue-level naturalness score. The authors created a controlled benchmark of paired natural and perturbed clips that human raters judged for naturalness, and experiments show the scores successfully flag unnatural perturbations of many kinds. A sympathetic reader would care because the method supplies a single, scalable metric that works across heterogeneous timing problems without repeated human evaluation.

Core claim

TurnNat is a likelihood-based framework that trains a causal turn-taking prediction model on natural conversations, computes the negative log-likelihood of observed future two-speaker voice-activity states to quantify timing atypicality, pools the values over turn-taking boundary units extracted from utterance boundaries, and aggregates mean and tail scores into a dialogue-level naturalness score; experiments on a controlled perturbation benchmark validated by human judgments show that the resulting scores identify unnatural turn-taking across heterogeneous timing failures.

What carries the argument

A causal turn-taking prediction model trained on natural conversations that estimates future two-speaker voice-activity states, with negative log-likelihood of the observed states serving as the measure of timing atypicality.

If this is right

Dialogue systems can be compared and improved on turn-taking quality using a single automatic score instead of separate human studies for each timing issue.
Overlaps, gaps, and other timing perturbations can be evaluated and ranked on the same numerical scale.
Training or fine-tuning of full-duplex dialogue models can incorporate the TurnNat score as a direct optimization target.
The approach can be applied to any new two-channel dialogue corpus once the prediction model is retrained on its natural examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the underlying prediction model generalizes well, the same scoring procedure could be transferred to new languages or domains by retraining only the voice-activity predictor.
TurnNat scores could be combined with other automatic dialogue metrics such as semantic appropriateness to produce a broader evaluation suite.
Real-time computation of the likelihood during live conversations might allow a system to adjust its own turn-taking behavior on the fly.

Load-bearing premise

That negative log-likelihood of future voice-activity states under a model trained on natural conversations measures timing atypicality in a way that aligns with human naturalness judgments.

What would settle it

A collection of dialogue clips in which human raters judge the timing as natural yet TurnNat assigns high atypicality scores, or clips in which perturbed timing receives lower scores than the corresponding natural versions.

Figures

Figures reproduced from arXiv: 2607.01345 by Georgi Tinchev, Hao Zhang, Laureano Moro-Velazquez, Thomas Thebaud, Venkatesh Ravichandran.

**Figure 1.** Figure 1: Overview of the TurnNat framework. TurnNat first extracts VAD-based turn-taking boundary units from the two-channel dialogue, then uses a causal [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Turn-taking naturalness is central to full-duplex spoken dialogue systems, yet its automatic evaluation remains limited. Existing evaluations often rely on human judgments or behavior-specific timing metrics, making it difficult to compare heterogeneous timing failures within a unified framework. We propose TurnNat, a likelihood-based framework for automatic turn-taking naturalness evaluation in two-channel spoken dialogue. A causal turn-taking prediction model trained on natural conversations estimates future two-speaker voice-activity states, and the negative log-likelihood (NLL) of the observed future activity measures timing atypicality. TurnNat pools frame-level NLLs over turn-taking boundary units (TBUs) extracted from utterance onsets and offsets, and aggregates mean and tail TBU scores into a dialogue-level naturalness score. We further construct a controlled perturbation benchmark of paired natural and perturbed dialogue clips, validated by human naturalness judgments. Experiments on this benchmark show that TurnNat successfully identifies unnatural turn-taking perturbations across heterogeneous timing failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TurnNat introduces a likelihood-based metric for turn-taking naturalness using NLL from a causal VAD predictor plus a human-validated perturbation benchmark.

read the letter

TurnNat gives a unified automatic way to score turn-taking naturalness using negative log-likelihood from a causal voice-activity predictor, and they back it with a human-validated perturbation benchmark.

The new part is the shift away from separate hand-crafted timing features toward a single likelihood score that can catch different kinds of timing problems. Training the model on natural two-speaker sequences and then measuring how unlikely the actual future states are makes sense for detecting atypical turn-taking. Aggregating at onsets and offsets with mean and tail scores is a practical aggregation step. The benchmark of paired clips with controlled perturbations is the part that lets them test whether the scores match human judgments.

The main soft spot is how tightly the NLL tracks actual human naturalness ratings. If the correlation is only moderate or if certain perturbation types are missed, the metric would need more work before it replaces listening tests. The paper would be stronger with more detail on the prediction model architecture and any ablations on the TBU pooling.

This is for teams building full-duplex spoken dialogue systems who want something more scalable than repeated human evaluations. A reader focused on dialogue system evaluation would get concrete ideas from the method and the benchmark construction. It deserves a serious referee because the core idea is coherent and the evaluation setup is testable.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes TurnNat, a likelihood-based framework for automatic evaluation of turn-taking naturalness in two-channel spoken dialogue. A causal model is trained on natural conversations to predict future two-speaker voice-activity states; negative log-likelihood (NLL) of observed future states, pooled over turn-taking boundary units (TBUs) at onsets/offsets and aggregated into mean and tail scores, serves as the naturalness metric. A controlled perturbation benchmark of paired natural/perturbed clips is constructed and human-validated. The central claim is that experiments on this benchmark demonstrate TurnNat successfully identifies unnatural perturbations across heterogeneous timing failures.

Significance. If the empirical results hold, the work would supply a unified, model-based automatic metric capable of comparing heterogeneous timing failures without task-specific rules or direct human rating at inference time. The human-validated perturbation benchmark itself constitutes a reusable contribution for controlled evaluation of turn-taking models.

major comments (2)

[Abstract] Abstract: the claim that 'Experiments on this benchmark show that TurnNat successfully identifies unnatural turn-taking perturbations across heterogeneous timing failures' is unsupported by any quantitative results, tables, figures, error analysis, or baseline comparisons in the manuscript description. This absence is load-bearing for the central empirical claim.
[Abstract / Experiments] The evaluation rests on the assumption that NLL of future VAD states under a model trained on natural data aligns with human naturalness judgments for perturbed dialogues; while a human-validated benchmark is mentioned, no details are supplied on perturbation generation, inter-rater agreement, or correlation between TBU NLL scores and human ratings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments point by point below. We agree that the abstract and evaluation sections require strengthening with additional quantitative support and methodological details, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Experiments on this benchmark show that TurnNat successfully identifies unnatural turn-taking perturbations across heterogeneous timing failures' is unsupported by any quantitative results, tables, figures, error analysis, or baseline comparisons in the manuscript description. This absence is load-bearing for the central empirical claim.

Authors: We agree that the abstract claim would be stronger if accompanied by quantitative evidence. The manuscript body contains the relevant experimental results, tables, and figures, but these are not referenced or summarized in the abstract. We will revise the abstract to include key quantitative outcomes (e.g., differences in mean and tail TBU NLL scores between natural and perturbed conditions, along with statistical significance) to directly support the claim. revision: yes
Referee: [Abstract / Experiments] The evaluation rests on the assumption that NLL of future VAD states under a model trained on natural data aligns with human naturalness judgments for perturbed dialogues; while a human-validated benchmark is mentioned, no details are supplied on perturbation generation, inter-rater agreement, or correlation between TBU NLL scores and human ratings.

Authors: We acknowledge the need for greater transparency on the benchmark construction and validation. In the revised manuscript we will expand the relevant section to include: (1) the precise procedure used to generate the controlled perturbations, (2) inter-rater agreement statistics from the human validation study, and (3) quantitative correlation results between the aggregated TBU NLL scores and the collected human naturalness ratings. These additions will make the alignment between model likelihood and human judgments explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines TurnNat as a likelihood-based scoring method: a causal model is trained on natural two-speaker VAD sequences, after which NLL of observed future states (aggregated over TBUs) serves as the atypicality measure. This score is then tested for its ability to flag artificially perturbed clips on a human-validated benchmark. No derivation chain, equation, or self-citation reduces the central claim to a tautology or fitted input by construction; the evaluation relies on external human judgments rather than internal re-use of the same fitted values. The approach is therefore self-contained against the supplied benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; full text, methods, and data unavailable.

axioms (1)

domain assumption A causal turn-taking prediction model trained on natural conversations can estimate future two-speaker voice-activity states whose negative log-likelihood measures timing atypicality.
This is the core premise enabling the NLL-based score.

pith-pipeline@v0.9.1-grok · 5710 in / 1193 out tokens · 24605 ms · 2026-07-03T21:20:36.206575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 1 internal anchor

[1]

A survey of recent advances on turn-taking modeling in spoken dialogue systems,

G. Castillo-L ´opez, G. de Chalendar, and N. Semmar, “A survey of recent advances on turn-taking modeling in spoken dialogue systems,” inProceedings of the 15th international workshop on spoken dialogue systems technology, 2025, pp. 254–271

2025
[2]

Response timing estimation for spoken dialog systems based on syntactic completeness prediction,

J. Sakuma, S. Fujie, and T. Kobayashi, “Response timing estimation for spoken dialog systems based on syntactic completeness prediction,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 369–374

2022
[3]

Wavchat: A survey of spoken dialogue models,

S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Chenget al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024
[4]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 57 607–57 624

2025
[6]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[7]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

X. Wang, Y . Li, C. Fu, Y . Zhang, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 63 345–63 354

2025
[8]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,

G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[9]

Mitigating response delays in free-form conversations with llm- powered intelligent virtual agents,

M. Maslych, M. Katebi, C. Lee, Y . Hmaiti, A. Ghasemaghaei, C. Pumarada, J. Palmer, E. Segarra Martinez, M. Emporio, W. Snipes et al., “Mitigating response delays in free-form conversations with llm- powered intelligent virtual agents,” inProceedings of the 7th ACM Conference on Conversational User Interfaces, 2025, pp. 1–15

2025
[10]

Neural generation of dialogue response timings,

M. Roddy and N. Harte, “Neural generation of dialogue response timings,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2442–2452

2020
[11]

Toward enabling natural conversation with older adults via the design of llm-powered voice agents that support interruptions and backchannels,

C. Liu, M. Su, Y . Xiang, Y . Huang, Y . Yang, K. Zhang, and M. Fan, “Toward enabling natural conversation with older adults via the design of llm-powered voice agents that support interruptions and backchannels,” inProceedings of the 2025 CHI conference on human factors in computing systems, 2025, pp. 1–22

2025
[12]

A mixed-methods approach to understanding user trust after voice assistant failures,

A. Baughan, X. Wang, A. Liu, A. Mercurio, J. Chen, and X. Ma, “A mixed-methods approach to understanding user trust after voice assistant failures,” inProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–16

2023
[13]

Patient engagement with conversational agents in health applications 2016–2022: a systematic review and meta-analysis,

K. E. Cevasco, R. E. Morrison Brown, R. Woldeselassie, and S. Kaplan, “Patient engagement with conversational agents in health applications 2016–2022: a systematic review and meta-analysis,”Journal of medical systems, vol. 48, no. 1, p. 40, 2024

2016
[14]

Empathic conversational agent platform designs and their evaluation in the context of mental health: systematic review,

R. Sanjeewa, R. Iyer, P. Apputhurai, N. Wickramasinghe, and D. Meyer, “Empathic conversational agent platform designs and their evaluation in the context of mental health: systematic review,”JMIR Mental Health, vol. 11, p. e58974, 2024

2024
[15]

Talking turns: Benchmarking audio foundation models on turn-taking dynamics,

S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talking turns: Benchmarking audio foundation models on turn-taking dynamics,”arXiv preprint arXiv:2503.01174, 2025

work page arXiv 2025
[16]

Turn-taking in conversational systems and human-robot interaction: a review,

G. Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,”Computer Speech & Language, vol. 67, p. 101178, 2021

2021
[17]

Turngpt: a transformer-based language model for predicting turn-taking in spoken dialog,

E. Ekstedt and G. Skantze, “Turngpt: a transformer-based language model for predicting turn-taking in spoken dialog,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2981–2990

2020
[18]

V oice activity projection: Self-supervised learning of turn-taking events,

——, “V oice activity projection: Self-supervised learning of turn-taking events,” inProc. Interspeech 2022, 2022, pp. 5190–5194

2022
[19]

Predicting turn-taking and backchannel in human-machine conversations using linguistic, acoustic, and visual signals,

Y . Lin, Y . Zheng, M. Zeng, and W. Shi, “Predicting turn-taking and backchannel in human-machine conversations using linguistic, acoustic, and visual signals,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 15 310–15 322

2025
[20]

Turn-taking and backchannel prediction with acoustic and large language model fusion,

J. Wang, L. Chen, A. Khare, A. Raju, P. Dheram, D. He, M. Wu, A. Stol- cke, and V . Ravichandran, “Turn-taking and backchannel prediction with acoustic and large language model fusion,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 121–12 125

2024
[21]

Dualturn: Learning turn-taking from dual-channel generative speech pretraining,

S. Rajaa, “Dualturn: Learning turn-taking from dual-channel generative speech pretraining,”arXiv preprint arXiv:2603.08216, 2026

work page arXiv 2026
[22]

Ntpp: Generative speech language modeling for dual- channel spoken dialogue via next-token-pair prediction,

Q. Wang, Z. Meng, W. Cui, Y . Zhang, P. Wu, B. Wu, I. King, L. Chen, and P. Zhao, “Ntpp: Generative speech language modeling for dual- channel spoken dialogue via next-token-pair prediction,”arXiv preprint arXiv:2506.00975, 2025

work page arXiv 2025
[23]

From reaction to prediction: Experiments with compu- tational models of turn-taking,

D. Schlangen, “From reaction to prediction: Experiments with compu- tational models of turn-taking,”Proceedings of Interspeech 2006, Panel on Prosody of Dialogue Acts and Turn-Taking, 2006

2006
[24]

Data-driven models for timing feedback responses in a map task dialogue system,

R. Meena, G. Skantze, and J. Gustafson, “Data-driven models for timing feedback responses in a map task dialogue system,”Computer Speech & Language, vol. 28, no. 4, pp. 903–922, 2014

2014
[25]

Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,

M. Johansson and G. Skantze, “Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,” inProceed- ings of the 16th annual meeting of the special interest group on discourse and dialogue, 2015, pp. 305–314

2015
[26]

Towards a general, continuous model of turn-taking in spoken dialogue using lstm recurrent neural networks,

G. Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using lstm recurrent neural networks,” inProceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 220–230

2017
[27]

Multi- lingual turn-taking prediction using voice activity projection,

K. Inoue, B. Jiang, E. Ekstedt, T. Kawahara, and G. Skantze, “Multi- lingual turn-taking prediction using voice activity projection,” inPro- ceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024), 2024, pp. 11 873–11 883

2024
[28]

Prompt-guided turn-taking prediction,

K. Inoue, M. Elmers, Y . Fu, Z. H. Pang, D. Lala, K. Ochi, and T. Kawahara, “Prompt-guided turn-taking prediction,” inProceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2025, pp. 146–151

2025
[29]

Real-time and continuous turn-taking prediction using voice activity projection,

K. Inoue, B. Jiang, E. Ekstedt, T. Kawahara, and G. Skantze, “Real-time and continuous turn-taking prediction using voice activity projection,” arXiv preprint arXiv:2401.04868, 2024

work page arXiv 2024
[30]

Visual cues enhance predictive turn-taking for two-party human interaction,

S. O. Russell and N. Harte, “Visual cues enhance predictive turn-taking for two-party human interaction,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 209–221. [Online]. Available: https://aclantholo...

2025
[31]

Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models,

G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, S. Watanabe, and H.-y. Lee, “Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 19 447–19 451

2026
[32]

Seamless inter- action: Dyadic audiovisual motion modeling and large-scale dataset,

V . Agrawal, A. Akinyemi, K. Alvero, M. Behrooz, J. Buffalini, F. M. Carlucci, J. Chen, J. Chen, Z. Chen, S. Chenget al., “Seamless inter- action: Dyadic audiovisual motion modeling and large-scale dataset,” arXiv preprint arXiv:2506.22554, 2025

work page arXiv 2025
[33]

Krippendorff,Content analysis: An introduction to its methodology

K. Krippendorff,Content analysis: An introduction to its methodology. Sage publications, 2018

2018

[1] [1]

A survey of recent advances on turn-taking modeling in spoken dialogue systems,

G. Castillo-L ´opez, G. de Chalendar, and N. Semmar, “A survey of recent advances on turn-taking modeling in spoken dialogue systems,” inProceedings of the 15th international workshop on spoken dialogue systems technology, 2025, pp. 254–271

2025

[2] [2]

Response timing estimation for spoken dialog systems based on syntactic completeness prediction,

J. Sakuma, S. Fujie, and T. Kobayashi, “Response timing estimation for spoken dialog systems based on syntactic completeness prediction,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 369–374

2022

[3] [3]

Wavchat: A survey of spoken dialogue models,

S. Ji, Y . Chen, M. Fang, J. Zuo, J. Lu, H. Wang, Z. Jiang, L. Zhou, S. Liu, X. Chenget al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024

[4] [4]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Llama- omni: Seamless speech interaction with large language models,

Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “Llama- omni: Seamless speech interaction with large language models,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 57 607–57 624

2025

[6] [6]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024

[7] [7]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

X. Wang, Y . Li, C. Fu, Y . Zhang, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 63 345–63 354

2025

[8] [8]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,

G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025

[9] [9]

Mitigating response delays in free-form conversations with llm- powered intelligent virtual agents,

M. Maslych, M. Katebi, C. Lee, Y . Hmaiti, A. Ghasemaghaei, C. Pumarada, J. Palmer, E. Segarra Martinez, M. Emporio, W. Snipes et al., “Mitigating response delays in free-form conversations with llm- powered intelligent virtual agents,” inProceedings of the 7th ACM Conference on Conversational User Interfaces, 2025, pp. 1–15

2025

[10] [10]

Neural generation of dialogue response timings,

M. Roddy and N. Harte, “Neural generation of dialogue response timings,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2442–2452

2020

[11] [11]

Toward enabling natural conversation with older adults via the design of llm-powered voice agents that support interruptions and backchannels,

C. Liu, M. Su, Y . Xiang, Y . Huang, Y . Yang, K. Zhang, and M. Fan, “Toward enabling natural conversation with older adults via the design of llm-powered voice agents that support interruptions and backchannels,” inProceedings of the 2025 CHI conference on human factors in computing systems, 2025, pp. 1–22

2025

[12] [12]

A mixed-methods approach to understanding user trust after voice assistant failures,

A. Baughan, X. Wang, A. Liu, A. Mercurio, J. Chen, and X. Ma, “A mixed-methods approach to understanding user trust after voice assistant failures,” inProceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023, pp. 1–16

2023

[13] [13]

Patient engagement with conversational agents in health applications 2016–2022: a systematic review and meta-analysis,

K. E. Cevasco, R. E. Morrison Brown, R. Woldeselassie, and S. Kaplan, “Patient engagement with conversational agents in health applications 2016–2022: a systematic review and meta-analysis,”Journal of medical systems, vol. 48, no. 1, p. 40, 2024

2016

[14] [14]

Empathic conversational agent platform designs and their evaluation in the context of mental health: systematic review,

R. Sanjeewa, R. Iyer, P. Apputhurai, N. Wickramasinghe, and D. Meyer, “Empathic conversational agent platform designs and their evaluation in the context of mental health: systematic review,”JMIR Mental Health, vol. 11, p. e58974, 2024

2024

[15] [15]

Talking turns: Benchmarking audio foundation models on turn-taking dynamics,

S. Arora, Z. Lu, C.-C. Chiu, R. Pang, and S. Watanabe, “Talking turns: Benchmarking audio foundation models on turn-taking dynamics,”arXiv preprint arXiv:2503.01174, 2025

work page arXiv 2025

[16] [16]

Turn-taking in conversational systems and human-robot interaction: a review,

G. Skantze, “Turn-taking in conversational systems and human-robot interaction: a review,”Computer Speech & Language, vol. 67, p. 101178, 2021

2021

[17] [17]

Turngpt: a transformer-based language model for predicting turn-taking in spoken dialog,

E. Ekstedt and G. Skantze, “Turngpt: a transformer-based language model for predicting turn-taking in spoken dialog,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2981–2990

2020

[18] [18]

V oice activity projection: Self-supervised learning of turn-taking events,

——, “V oice activity projection: Self-supervised learning of turn-taking events,” inProc. Interspeech 2022, 2022, pp. 5190–5194

2022

[19] [19]

Predicting turn-taking and backchannel in human-machine conversations using linguistic, acoustic, and visual signals,

Y . Lin, Y . Zheng, M. Zeng, and W. Shi, “Predicting turn-taking and backchannel in human-machine conversations using linguistic, acoustic, and visual signals,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), 2025, pp. 15 310–15 322

2025

[20] [20]

Turn-taking and backchannel prediction with acoustic and large language model fusion,

J. Wang, L. Chen, A. Khare, A. Raju, P. Dheram, D. He, M. Wu, A. Stol- cke, and V . Ravichandran, “Turn-taking and backchannel prediction with acoustic and large language model fusion,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 121–12 125

2024

[21] [21]

Dualturn: Learning turn-taking from dual-channel generative speech pretraining,

S. Rajaa, “Dualturn: Learning turn-taking from dual-channel generative speech pretraining,”arXiv preprint arXiv:2603.08216, 2026

work page arXiv 2026

[22] [22]

Ntpp: Generative speech language modeling for dual- channel spoken dialogue via next-token-pair prediction,

Q. Wang, Z. Meng, W. Cui, Y . Zhang, P. Wu, B. Wu, I. King, L. Chen, and P. Zhao, “Ntpp: Generative speech language modeling for dual- channel spoken dialogue via next-token-pair prediction,”arXiv preprint arXiv:2506.00975, 2025

work page arXiv 2025

[23] [23]

From reaction to prediction: Experiments with compu- tational models of turn-taking,

D. Schlangen, “From reaction to prediction: Experiments with compu- tational models of turn-taking,”Proceedings of Interspeech 2006, Panel on Prosody of Dialogue Acts and Turn-Taking, 2006

2006

[24] [24]

Data-driven models for timing feedback responses in a map task dialogue system,

R. Meena, G. Skantze, and J. Gustafson, “Data-driven models for timing feedback responses in a map task dialogue system,”Computer Speech & Language, vol. 28, no. 4, pp. 903–922, 2014

2014

[25] [25]

Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,

M. Johansson and G. Skantze, “Opportunities and obligations to take turns in collaborative multi-party human-robot interaction,” inProceed- ings of the 16th annual meeting of the special interest group on discourse and dialogue, 2015, pp. 305–314

2015

[26] [26]

Towards a general, continuous model of turn-taking in spoken dialogue using lstm recurrent neural networks,

G. Skantze, “Towards a general, continuous model of turn-taking in spoken dialogue using lstm recurrent neural networks,” inProceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 220–230

2017

[27] [27]

Multi- lingual turn-taking prediction using voice activity projection,

K. Inoue, B. Jiang, E. Ekstedt, T. Kawahara, and G. Skantze, “Multi- lingual turn-taking prediction using voice activity projection,” inPro- ceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024), 2024, pp. 11 873–11 883

2024

[28] [28]

Prompt-guided turn-taking prediction,

K. Inoue, M. Elmers, Y . Fu, Z. H. Pang, D. Lala, K. Ochi, and T. Kawahara, “Prompt-guided turn-taking prediction,” inProceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2025, pp. 146–151

2025

[29] [29]

Real-time and continuous turn-taking prediction using voice activity projection,

K. Inoue, B. Jiang, E. Ekstedt, T. Kawahara, and G. Skantze, “Real-time and continuous turn-taking prediction using voice activity projection,” arXiv preprint arXiv:2401.04868, 2024

work page arXiv 2024

[30] [30]

Visual cues enhance predictive turn-taking for two-party human interaction,

S. O. Russell and N. Harte, “Visual cues enhance predictive turn-taking for two-party human interaction,” inFindings of the Association for Computational Linguistics: ACL 2025, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 209–221. [Online]. Available: https://aclantholo...

2025

[31] [31]

Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models,

G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, S. Watanabe, and H.-y. Lee, “Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 19 447–19 451

2026

[32] [32]

Seamless inter- action: Dyadic audiovisual motion modeling and large-scale dataset,

V . Agrawal, A. Akinyemi, K. Alvero, M. Behrooz, J. Buffalini, F. M. Carlucci, J. Chen, J. Chen, Z. Chen, S. Chenget al., “Seamless inter- action: Dyadic audiovisual motion modeling and large-scale dataset,” arXiv preprint arXiv:2506.22554, 2025

work page arXiv 2025

[33] [33]

Krippendorff,Content analysis: An introduction to its methodology

K. Krippendorff,Content analysis: An introduction to its methodology. Sage publications, 2018

2018