Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Chun-Yi Kuan; En-Pei Hu; Guan-Ting Lin; Hung-yi Lee; James Glass; Kai-Wei Chang; Shao-Hua Sun; Wei-Chih Chen; Wenze Ren; Yu Tsao

arxiv: 2509.26388 · v4 · submitted 2025-09-30 · 📡 eess.AS · cs.AI· cs.CL

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang , En-Pei Hu , Chun-Yi Kuan , Wenze Ren , Wei-Chih Chen , Guan-Ting Lin , Yu Tsao , Shao-Hua Sun

show 2 more authors

Hung-yi Lee James Glass

This is my paper

Pith reviewed 2026-05-18 11:45 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CL

keywords spoken language modelstemporal dynamicsbenchmarkconversational AItime awarenessfull-duplex interactionspeech timingreal-time speech

0 comments

The pith

Spoken language models handle basic tasks but degrade sharply when required to manage timing, tempo, and synchronized speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Game-Time Benchmark as a way to test temporal dynamics in conversational spoken language models, including the ability to follow timing rules and respond in sync with others. Evaluation across architectures shows solid results on simple instructions but major drops once constraints on tempo or simultaneous speaking are introduced. This matters for anyone building real-time voice systems, because natural conversation depends on handling those exact timing elements without breaking flow. The work positions the benchmark as a practical tool to expose and close these gaps in current models.

Core claim

The Game-Time Benchmark, built from basic instruction-following tasks and advanced tasks that add temporal constraints such as tempo adherence and synchronized responses and inspired by human language-learning activities, shows that state-of-the-art spoken language models manage basic tasks adequately yet suffer substantial degradation under temporal constraints, revealing persistent weaknesses in time awareness and full-duplex interaction.

What carries the argument

Game-Time Benchmark, a two-tier framework of basic and temporally constrained tasks that directly measures timing, tempo, and simultaneous-speaking abilities.

If this is right

Development of spoken language models must explicitly target time awareness to reach usable conversational fluency.
Full-duplex capabilities remain unreliable across most current architectures under realistic timing loads.
The benchmark supplies concrete metrics that can direct iterative improvements in temporal handling.
Without addressing these drops, real-time speech systems will continue to fall short of human-like interaction standards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark tasks prove easier than messy real conversations, the performance gaps could widen further in deployment.
Adding explicit temporal supervision during training might reduce the observed degradation without changing model scale.
Extending the tasks to multi-party settings with interruptions would test whether the weaknesses generalize beyond pairwise exchanges.

Load-bearing premise

The Game-Time tasks and metrics accurately reflect real-world conversational temporal dynamics and fluency requirements.

What would settle it

If models maintain high accuracy and low latency when tested on unscripted real-time dialogues that require precise timing, tempo changes, and overlapping speech, the reported weaknesses would be undermined.

read the original abstract

Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Game-Time sketches a benchmark for temporal skills in spoken LMs but the abstract alone leaves the tasks and claims too underspecified to judge if they measure real conversational dynamics.

read the letter

The main thing to know is that this paper introduces the Game-Time benchmark to test how spoken language models handle timing, tempo, and simultaneous speaking in conversation. The abstract claims that models do okay on basic tasks but degrade sharply once temporal constraints are added, which would matter for real-time systems if it holds up. That is the core pitch, and it is worth noting because temporal awareness is an underexplored angle in current SLM evaluations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Game-Time Benchmark to systematically assess temporal capabilities in conversational Spoken Language Models (SLMs), including basic instruction-following tasks and advanced tasks with temporal constraints such as tempo adherence and synchronized responses. Evaluation of diverse SLM architectures shows that state-of-the-art models handle basic tasks well, but nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The benchmark is positioned as a foundation for future research, with demos and datasets available online.

Significance. If the results hold after proper validation, this work addresses an important gap in evaluating real-time speech interaction by focusing on timing, tempo, and simultaneous speaking, which are essential for conversational fluency. It could help guide development of more temporally-aware SLMs and provides public resources that support reproducibility.

major comments (2)

[Abstract] Abstract: The central claim that 'nearly all models degrade substantially under temporal constraints' is stated without any details on model selection, exact task definitions, metrics, statistical tests, or controls. This absence makes the performance disparity and degradation claims unverifiable and is load-bearing for the paper's main contribution.
[Abstract] Abstract: The assumption that Game-Time tasks and metrics accurately reflect real-world conversational temporal dynamics is not supported by any validation, such as human baselines or comparison to real conversational data. Without this, it remains possible that observed degradations reflect benchmark artifacts rather than genuine limitations in time awareness.

minor comments (1)

[Abstract] Abstract: Consider adding a short clause on the number or diversity of SLM architectures evaluated to better contextualize the scope of the reported disparities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and robustness of our work. Below, we provide point-by-point responses to the major comments and describe the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'nearly all models degrade substantially under temporal constraints' is stated without any details on model selection, exact task definitions, metrics, statistical tests, or controls. This absence makes the performance disparity and degradation claims unverifiable and is load-bearing for the paper's main contribution.

Authors: We agree that the abstract, in its current form, does not include sufficient details to make the central claims fully verifiable on its own. The full manuscript contains these specifics in the benchmark design and experimental sections. To address this directly, we have revised the abstract to briefly include information on the diverse SLM architectures evaluated, the distinction between basic instruction-following tasks and advanced tasks involving temporal constraints such as tempo adherence and synchronized responses, the primary metrics used (including accuracy and deviation measures), and reference to statistical comparisons confirming the observed degradation. This revision ensures the claims are more self-contained and verifiable while respecting abstract length limits. revision: yes
Referee: [Abstract] Abstract: The assumption that Game-Time tasks and metrics accurately reflect real-world conversational temporal dynamics is not supported by any validation, such as human baselines or comparison to real conversational data. Without this, it remains possible that observed degradations reflect benchmark artifacts rather than genuine limitations in time awareness.

Authors: This is a valid observation. The tasks are motivated by human language acquisition activities to target core temporal elements of conversation, but the manuscript does not include direct human performance baselines or explicit comparisons to real-world dialogue data. In the revision, we have updated the abstract to clarify the benchmark's purpose as a controlled testbed for temporal capabilities and added a discussion of design rationale along with an explicit acknowledgment of this as a limitation. We maintain that the substantial degradations under temporal constraints highlight genuine weaknesses in current SLMs' time awareness, but we agree that additional validation would further strengthen the claims. revision: partial

Circularity Check

0 steps flagged

No circularity in new benchmark evaluation

full rationale

The abstract introduces the Game-Time Benchmark as a novel framework for assessing temporal dynamics in SLMs, consisting of basic instruction-following tasks and advanced tasks with temporal constraints such as tempo adherence and synchronized responses. The central empirical finding—that nearly all models degrade under temporal constraints—is presented as a direct measurement on these newly defined tasks rather than a derivation that reduces to prior fitted parameters, self-citations, or self-referential definitions by construction. No equations, ansatzes, uniqueness theorems, or load-bearing self-citations appear in the provided text, so the evaluation remains self-contained against the external benchmark without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; no free parameters, new entities, or ad-hoc axioms are explicitly introduced in the summary.

axioms (1)

domain assumption Instruction-following tasks can serve as a valid proxy for measuring temporal dynamics in conversational models
The benchmark structure relies on this to separate basic and advanced performance.

pith-pipeline@v0.9.0 · 5717 in / 1132 out tokens · 37479 ms · 2026-05-18T11:45:20.002217+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering
eess.AS 2026-01 unverdicted novelty 7.0

AQUA-Bench evaluates audio QA models on three unanswerability scenarios: missing correct answers, mismatched choice sets, and questions irrelevant to the audio.
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
cs.CL 2025-12 accept novelty 7.0

Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 4 Pith papers · 11 internal anchors

[1]

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

INTRODUCTION In the pursuit of human-like conversation with machines, the re- search frontier is moving beyond text-based Large Language Models (LLMs). The next challenge lies in mastering conversational dynam- ics in real-time speech, which has given rise to the field of conver- sational Spoken Language Models (SLMs) [1, 2, 3, 4, 5, 6, 7]. This marks a c...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

RELATED WORKS 2.1. Full-duplex Spoken Language Models Recent work has explored how SLMs can move beyond turn-based interaction toward full-duplex conversation [12, 22, 23, 24, 25, 26], where listening and speaking occur simultaneously. Two main mod- eling strategies have emerged to achieve full-duplex capability [3]: (1) Dual-channel SLMs[22, 17, 27, 28] ...

work page
[3]

Please count from one to ten in10seconds

GAME-TIME BENCHMARK We introduce theGame-Time Benchmarkto evaluate SLMs on their understanding oftime,tempo, and timelysimultaneously speaking. In this section, we define the task families, describe how the bench- mark is constructed, and outline the evaluation protocol. 3.1. Task Families Inspired by how humans learn a language with language activities a...

work page
[4]

Open-Ended

EXPERIMENTAL SETUP We evaluate various SLMs on the Game-Time Benchmark with different full-duplex strategies (see Table 2). This includesTime- Multiplexingmodels (Freeze-Omni [19], Unmute [18]) which use a modular pipeline of a streaming encoder, a frozen LLM, and a streaming decoder; and aDual-channelmodel (Moshi [17]) where a fine-tuned LLM directly pro...

work page
[5]

Main Results Basic Tasks:As shown in Fig

RESULTS 5.1. Main Results Basic Tasks:As shown in Fig. 3 (Top), the oracle topline consis- tently achieves the best performance across all tasks. GPT-realtime shows strong performance on most Basic Tasks, and it is worth not- ing that inRepeat, it is the only model that delivers reasonable per- formance. On the other hand, we observe that time-multiplexin...

work page
[6]

We evaluated various SLMs with a series of tasks testing temporal capabilities of timing, tempo, and simultaneous speaking

CONCLUSION This paper introduced the Game-Time Benchmark to address a criti- cal gap in the evaluation of the temporal dynamics of conversational Spoken Language Models (SLMs). We evaluated various SLMs with a series of tasks testing temporal capabilities of timing, tempo, and simultaneous speaking. Our results reveal a clear gap, with some models able to...

work page
[7]

ACKNOWLEDGMENT We are grateful to Yi-Cheng Lin and Cheng-Han Chiang for their valuable discussions on evaluation methods, and to Shih-Yun Shan Kuan for assistance with commercial API usage

work page
[8]

WavChat: A survey of spoken dialogue models.arXiv preprint arXiv:2411.13577,

Shengpeng Ji et al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024
[9]

Recent advances in speech language mod- els: A survey,

Wenqian Cui et al., “Recent advances in speech language mod- els: A survey,” inACL (1). 2025, pp. 13943–13970, Associa- tion for Computational Linguistics

work page 2025
[10]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora et al., “On the landscape of spoken lan- guage models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu J

Haibin Wu et al., “Towards audio language modeling–an overview,”arXiv preprint arXiv:2402.13236, 2024

work page arXiv 2024
[12]

Schuller

Siddique Latif et al., “Sparks of large audio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023

work page arXiv 2023
[13]

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model,

Ke Hu et al., “Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model,” inInterspeech 2025, 2025, pp. 2715–2719

work page 2025
[14]

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu et al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Language model can listen while speak- ing,

Ziyang Ma et al., “Language model can listen while speak- ing,” inProceedings of the AAAI Conference on Artificial In- telligence, 2025, vol. 39, pp. 24831–24839

work page 2025
[16]

, Chen, Y

Qian Chen, Yafeng Chen, Yanni Chen, et al., “MinMo: A mul- timodal large language model for seamless voice interaction,” arXiv preprint arXiv:2501.06282, 2025

work page arXiv 2025
[17]

Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, and 1 others

Ruiqi Yan et al., “URO-Bench: A comprehensive bench- mark for end-to-end spoken dialogue models,”arXiv preprint arXiv:2502.17810, 2025

work page arXiv 2025
[18]

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang et al., “Towards holistic evaluation of large audio-language models: A comprehensive survey,”arXiv preprint arXiv:2505.15957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122

Guan-Ting Lin et al., “Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking ca- pabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[20]

Talking turns: Benchmarking audio foundation models on turn-taking dynamics,

Siddhant Arora et al., “Talking turns: Benchmarking audio foundation models on turn-taking dynamics,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

Child’s talk: Learning to use language,

Jerome Bruner, “Child’s talk: Learning to use language,”Child Language Teaching and Therapy, vol. 1, pp. 111–114, 1985

work page 1985
[22]

Games, social exchange and the acquisition of language,

Nancy Ratner and Jerome Bruner, “Games, social exchange and the acquisition of language,”Journal of child language, vol. 5, no. 3, pp. 391–401, 1978

work page 1978
[23]

David Whitebread et al.,The role of play in children’s de- velopment: A review of the evidence, LEGO Fonden Billund, Denmark, 2017

work page 2017
[24]

Moshi: a speech-text foundation model for real-time dialogue

D ´efossez et al., “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Streaming sequence-to-sequence learning with delayed streams modeling,

Neil Zeghidour et al., “Streaming sequence-to-sequence learning with delayed streams modeling,”arXiv preprint arXiv:2509.08753, 2025

work page arXiv 2025
[26]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm,

Xiong Wang et al., “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm,” inForty- second International Conference on Machine Learning, 2025

work page 2025
[27]

Gemini live: A more helpful, natural and visual as- sistant,

Google, “Gemini live: A more helpful, natural and visual as- sistant,” Aug. 2025

work page 2025
[28]

Introducing gpt-realtime and realtime api updates for production voice agents,

OpenAI, “Introducing gpt-realtime and realtime api updates for production voice agents,” Aug. 2025

work page 2025
[29]

Generative spoken dialogue language modeling,

Tu Anh Nguyen et al., “Generative spoken dialogue language modeling,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023

work page 2023
[30]

Reinforcement learning enhanced full-duplex spoken dialogue language models for conversational interactions,

Chen Chen, Ke Hu, Chao-Han Huck Yang, Ankita Pasad, et al., “Reinforcement learning enhanced full-duplex spoken dialogue language models for conversational interactions,” in Second Conference on Language Modeling, 2025

work page 2025
[31]

Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

Wenyi Yu, Siyin Wang, et al., “Salmonn-omni: A standalone speech llm without codec injection for full-duplex conversa- tion,”arXiv preprint arXiv:2505.17060, 2025

work page arXiv 2025
[32]

Qwen2.5-Omni Technical Report

Jin Xu et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

OmniFlatten: An end-to-end GPT model for seamless voice conversation,

Qinglin Zhang et al., “OmniFlatten: An end-to-end GPT model for seamless voice conversation,” inProc. ACL, 2025

work page 2025
[34]

NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,

Qichao Wang et al., “NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,” inForty-second ICML, 2025

work page 2025
[35]

Aligning spoken dialogue models from user interactions,

Anne Wu et al., “Aligning spoken dialogue models from user interactions,” inForty-second ICML, 2025

work page 2025
[36]

Beyond the turn-based game: Enabling real-time conversations with duplex models,

Xinrong Zhang et al., “Beyond the turn-based game: Enabling real-time conversations with duplex models,”arXiv preprint arXiv:2406.15718, 2024

work page arXiv 2024
[37]

A full-duplex speech dialogue scheme based on large language model,

Peng Wang et al., “A full-duplex speech dialogue scheme based on large language model,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 13372–13403, 2024

work page 2024
[38]

SD-Eval: A benchmark dataset for spoken dialogue understanding beyond words,

Junyi Ao, Yuancheng Wang, et al., “SD-Eval: A benchmark dataset for spoken dialogue understanding beyond words,”Ad- vances in Neural Information Processing Systems, 2024

work page 2024
[39]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen et al., “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Benchmarking open-ended audio dia- logue understanding for large audio-language models,

Kuofeng Gao et al., “Benchmarking open-ended audio dia- logue understanding for large audio-language models,” inACL (1). 2025, Association for Computational Linguistics

work page 2025
[41]

Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models,

Heyang Liu, Yuhao Wang, et al., “V ocalbench: Benchmarking the vocal conversational abilities for speech interaction mod- els,”arXiv preprint arXiv:2505.15727, 2025

work page arXiv 2025
[42]

Wildspeech-bench: Benchmarking end-to-end speechllms in the wild,

Jian Zhang et al., “Wildspeech-bench: Benchmarking au- dio llms in natural speech conversation,”arXiv preprint arXiv:2506.21875, 2025

work page arXiv 2025
[43]

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems,

Yizhou Peng et al., “FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems,” inInterspeech 2025, 2025, pp. 176–180

work page 2025
[44]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou et al., “Instruction-following evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Generalizing Verifiable Instruction Following

Valentina Pyatkin et al., “Generalizing verifiable instruction following,”arXiv preprint arXiv:2507.02833, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du et al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Audio-aware large language models as judges for speaking styles,

Cheng-Han Chiang et al., “Audio-aware large language models as judges for speaking styles,”arXiv preprint arXiv:2506.05984, 2025

work page arXiv 2025
[48]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

INTRODUCTION In the pursuit of human-like conversation with machines, the re- search frontier is moving beyond text-based Large Language Models (LLMs). The next challenge lies in mastering conversational dynam- ics in real-time speech, which has given rise to the field of conver- sational Spoken Language Models (SLMs) [1, 2, 3, 4, 5, 6, 7]. This marks a c...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

RELATED WORKS 2.1. Full-duplex Spoken Language Models Recent work has explored how SLMs can move beyond turn-based interaction toward full-duplex conversation [12, 22, 23, 24, 25, 26], where listening and speaking occur simultaneously. Two main mod- eling strategies have emerged to achieve full-duplex capability [3]: (1) Dual-channel SLMs[22, 17, 27, 28] ...

work page

[3] [3]

Please count from one to ten in10seconds

GAME-TIME BENCHMARK We introduce theGame-Time Benchmarkto evaluate SLMs on their understanding oftime,tempo, and timelysimultaneously speaking. In this section, we define the task families, describe how the bench- mark is constructed, and outline the evaluation protocol. 3.1. Task Families Inspired by how humans learn a language with language activities a...

work page

[4] [4]

Open-Ended

EXPERIMENTAL SETUP We evaluate various SLMs on the Game-Time Benchmark with different full-duplex strategies (see Table 2). This includesTime- Multiplexingmodels (Freeze-Omni [19], Unmute [18]) which use a modular pipeline of a streaming encoder, a frozen LLM, and a streaming decoder; and aDual-channelmodel (Moshi [17]) where a fine-tuned LLM directly pro...

work page

[5] [5]

Main Results Basic Tasks:As shown in Fig

RESULTS 5.1. Main Results Basic Tasks:As shown in Fig. 3 (Top), the oracle topline consis- tently achieves the best performance across all tasks. GPT-realtime shows strong performance on most Basic Tasks, and it is worth not- ing that inRepeat, it is the only model that delivers reasonable per- formance. On the other hand, we observe that time-multiplexin...

work page

[6] [6]

We evaluated various SLMs with a series of tasks testing temporal capabilities of timing, tempo, and simultaneous speaking

CONCLUSION This paper introduced the Game-Time Benchmark to address a criti- cal gap in the evaluation of the temporal dynamics of conversational Spoken Language Models (SLMs). We evaluated various SLMs with a series of tasks testing temporal capabilities of timing, tempo, and simultaneous speaking. Our results reveal a clear gap, with some models able to...

work page

[7] [7]

ACKNOWLEDGMENT We are grateful to Yi-Cheng Lin and Cheng-Han Chiang for their valuable discussions on evaluation methods, and to Shih-Yun Shan Kuan for assistance with commercial API usage

work page

[8] [8]

WavChat: A survey of spoken dialogue models.arXiv preprint arXiv:2411.13577,

Shengpeng Ji et al., “Wavchat: A survey of spoken dialogue models,”arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024

[9] [9]

Recent advances in speech language mod- els: A survey,

Wenqian Cui et al., “Recent advances in speech language mod- els: A survey,” inACL (1). 2025, pp. 13943–13970, Associa- tion for Computational Linguistics

work page 2025

[10] [10]

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora et al., “On the landscape of spoken lan- guage models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu J

Haibin Wu et al., “Towards audio language modeling–an overview,”arXiv preprint arXiv:2402.13236, 2024

work page arXiv 2024

[12] [12]

Schuller

Siddique Latif et al., “Sparks of large audio models: A survey and outlook,”arXiv preprint arXiv:2308.12792, 2023

work page arXiv 2023

[13] [13]

Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model,

Ke Hu et al., “Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model,” inInterspeech 2025, 2025, pp. 2715–2719

work page 2025

[14] [14]

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu et al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Language model can listen while speak- ing,

Ziyang Ma et al., “Language model can listen while speak- ing,” inProceedings of the AAAI Conference on Artificial In- telligence, 2025, vol. 39, pp. 24831–24839

work page 2025

[16] [16]

, Chen, Y

Qian Chen, Yafeng Chen, Yanni Chen, et al., “MinMo: A mul- timodal large language model for seamless voice interaction,” arXiv preprint arXiv:2501.06282, 2025

work page arXiv 2025

[17] [17]

Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, and 1 others

Ruiqi Yan et al., “URO-Bench: A comprehensive bench- mark for end-to-end spoken dialogue models,”arXiv preprint arXiv:2502.17810, 2025

work page arXiv 2025

[18] [18]

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang et al., “Towards holistic evaluation of large audio-language models: A comprehensive survey,”arXiv preprint arXiv:2505.15957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

In2024 IEEE Spo- ken Language Technology Workshop (SLT), pages 1115–1122

Guan-Ting Lin et al., “Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking ca- pabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025

[20] [20]

Talking turns: Benchmarking audio foundation models on turn-taking dynamics,

Siddhant Arora et al., “Talking turns: Benchmarking audio foundation models on turn-taking dynamics,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[21] [21]

Child’s talk: Learning to use language,

Jerome Bruner, “Child’s talk: Learning to use language,”Child Language Teaching and Therapy, vol. 1, pp. 111–114, 1985

work page 1985

[22] [22]

Games, social exchange and the acquisition of language,

Nancy Ratner and Jerome Bruner, “Games, social exchange and the acquisition of language,”Journal of child language, vol. 5, no. 3, pp. 391–401, 1978

work page 1978

[23] [23]

David Whitebread et al.,The role of play in children’s de- velopment: A review of the evidence, LEGO Fonden Billund, Denmark, 2017

work page 2017

[24] [24]

Moshi: a speech-text foundation model for real-time dialogue

D ´efossez et al., “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Streaming sequence-to-sequence learning with delayed streams modeling,

Neil Zeghidour et al., “Streaming sequence-to-sequence learning with delayed streams modeling,”arXiv preprint arXiv:2509.08753, 2025

work page arXiv 2025

[26] [26]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm,

Xiong Wang et al., “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm,” inForty- second International Conference on Machine Learning, 2025

work page 2025

[27] [27]

Gemini live: A more helpful, natural and visual as- sistant,

Google, “Gemini live: A more helpful, natural and visual as- sistant,” Aug. 2025

work page 2025

[28] [28]

Introducing gpt-realtime and realtime api updates for production voice agents,

OpenAI, “Introducing gpt-realtime and realtime api updates for production voice agents,” Aug. 2025

work page 2025

[29] [29]

Generative spoken dialogue language modeling,

Tu Anh Nguyen et al., “Generative spoken dialogue language modeling,”Transactions of the Association for Computational Linguistics, vol. 11, pp. 250–266, 2023

work page 2023

[30] [30]

Reinforcement learning enhanced full-duplex spoken dialogue language models for conversational interactions,

Chen Chen, Ke Hu, Chao-Han Huck Yang, Ankita Pasad, et al., “Reinforcement learning enhanced full-duplex spoken dialogue language models for conversational interactions,” in Second Conference on Language Modeling, 2025

work page 2025

[31] [31]

Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

Wenyi Yu, Siyin Wang, et al., “Salmonn-omni: A standalone speech llm without codec injection for full-duplex conversa- tion,”arXiv preprint arXiv:2505.17060, 2025

work page arXiv 2025

[32] [32]

Qwen2.5-Omni Technical Report

Jin Xu et al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

OmniFlatten: An end-to-end GPT model for seamless voice conversation,

Qinglin Zhang et al., “OmniFlatten: An end-to-end GPT model for seamless voice conversation,” inProc. ACL, 2025

work page 2025

[34] [34]

NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,

Qichao Wang et al., “NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,” inForty-second ICML, 2025

work page 2025

[35] [35]

Aligning spoken dialogue models from user interactions,

Anne Wu et al., “Aligning spoken dialogue models from user interactions,” inForty-second ICML, 2025

work page 2025

[36] [36]

Beyond the turn-based game: Enabling real-time conversations with duplex models,

Xinrong Zhang et al., “Beyond the turn-based game: Enabling real-time conversations with duplex models,”arXiv preprint arXiv:2406.15718, 2024

work page arXiv 2024

[37] [37]

A full-duplex speech dialogue scheme based on large language model,

Peng Wang et al., “A full-duplex speech dialogue scheme based on large language model,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 13372–13403, 2024

work page 2024

[38] [38]

SD-Eval: A benchmark dataset for spoken dialogue understanding beyond words,

Junyi Ao, Yuancheng Wang, et al., “SD-Eval: A benchmark dataset for spoken dialogue understanding beyond words,”Ad- vances in Neural Information Processing Systems, 2024

work page 2024

[39] [39]

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen et al., “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Benchmarking open-ended audio dia- logue understanding for large audio-language models,

Kuofeng Gao et al., “Benchmarking open-ended audio dia- logue understanding for large audio-language models,” inACL (1). 2025, Association for Computational Linguistics

work page 2025

[41] [41]

Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models,

Heyang Liu, Yuhao Wang, et al., “V ocalbench: Benchmarking the vocal conversational abilities for speech interaction mod- els,”arXiv preprint arXiv:2505.15727, 2025

work page arXiv 2025

[42] [42]

Wildspeech-bench: Benchmarking end-to-end speechllms in the wild,

Jian Zhang et al., “Wildspeech-bench: Benchmarking au- dio llms in natural speech conversation,”arXiv preprint arXiv:2506.21875, 2025

work page arXiv 2025

[43] [43]

FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems,

Yizhou Peng et al., “FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems,” inInterspeech 2025, 2025, pp. 176–180

work page 2025

[44] [44]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou et al., “Instruction-following evaluation for large language models,”arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Generalizing Verifiable Instruction Following

Valentina Pyatkin et al., “Generalizing verifiable instruction following,”arXiv preprint arXiv:2507.02833, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du et al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Audio-aware large language models as judges for speaking styles,

Cheng-Han Chiang et al., “Audio-aware large language models as judges for speaking styles,”arXiv preprint arXiv:2506.05984, 2025

work page arXiv 2025

[48] [48]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici et al., “Gemini 2.5: Pushing the fron- tier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025