arxiv: 2603.22267 · v2 · submitted 2026-03-23 · 💻 cs.CL · cs.AI· eess.AS

Recognition: 2 theorem links

· Lean Theorem

TiCo: Time-Controllable Spoken Dialogue Model

Kai-Wei Chang , Wei-Chih Chen , En-Pei Hu , Hung-yi Lee , James Glass

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIeess.AS

keywords spoken dialogue modelsduration controlspoken time markersreinforcement learninginstruction followingself-supervised trainingvoice assistants

0 comments

The pith

TiCo lets spoken dialogue models follow time constraints such as 'generate a 15-second response' by inserting markers that track elapsed speaking time during generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TiCo as a post-training method that adds time awareness to spoken dialogue models. Current models generate natural speech but ignore explicit duration instructions and produce responses whose lengths vary widely from targets. TiCo inserts Spoken Time Markers at generation steps so the model can monitor elapsed time and adjust remaining content to hit a specified total duration. Training uses reinforcement learning on the model's own outputs with a reward that directly measures duration match, avoiding any need for paired question-answer data. The result is substantially lower duration error while response quality stays the same.

Core claim

TiCo post-trains a spoken dialogue model to insert Spoken Time Markers such as <10.6 seconds> during autoregressive generation; these markers supply explicit elapsed-time information that lets the model estimate remaining time and modulate content length to satisfy an instruction-specified target duration, all without paired training data and using only reinforcement learning on self-generated trajectories scored by a verifiable duration reward.

What carries the argument

Spoken Time Markers (STM) are special tokens inserted at each generation step that encode the cumulative speaking time so far, allowing the model to maintain an internal clock and adjust token choices to meet a target total duration.

If this is right

Duration error drops by a factor of 2.7 relative to the original backbone model.
Duration error drops by a factor of 1.6 relative to the strongest prior baseline.
Response quality metrics remain statistically unchanged after the post-training stage.
The method requires no paired question-answer data, relying only on self-generated trajectories and a verifiable reward.
TiCo-Bench provides a standardized set of time-constrained instructions for evaluating future spoken dialogue models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marker-plus-RL pattern could be applied to other verifiable constraints such as emotional tone or speaking rate without paired supervision.
In deployed voice assistants, accurate duration control may reduce user interruptions and improve perceived turn-taking naturalness.
Because training data are self-generated, the approach scales to new domains or languages as long as a duration verifier can be defined.

Load-bearing premise

That Spoken Time Markers inserted during generation give the model enough accurate time information to steer total duration without harming speech naturalness.

What would settle it

Measure actual audio durations of responses generated under explicit time targets and check whether the error remains at least 2.7 times smaller than the backbone model across a range of target lengths.

Figures

Figures reproduced from arXiv: 2603.22267 by En-Pei Hu, Hung-yi Lee, James Glass, Kai-Wei Chang, Wei-Chih Chen.

**Figure 1.** Figure 1: Overview of TiCo, a two-stage framework for time-controllable speech generation. Stage 1 (top): The model leverages self-generation to produce responses annotated with Spoken Time Markers (STMs), which serve as supervision for learning time awareness, i.e., associating intermediate generation states with temporal progress and estimating elapsed speaking time. Stage 2 (bottom): The model is further optimiz… view at source ↗

**Figure 2.** Figure 2: Overview of TiCo-Bench construction. Base queries are collected from four distinct text and speech datasets (totaling 720 queries). Explicit time-control instructions are then inserted into these queries. By applying both a short-duration setting (10–30 secs) and a long-duration setting (30–60 secs) to each query, the initial dataset is doubled, resulting in a final benchmark of 1440 evaluation samples. ta… view at source ↗

**Figure 3.** Figure 3: Distribution of Spoken Time Markers in the First stage training data. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Duration MAE and MAPE of Qwen2-Omni-7B and TiCo across instructed-duration bins. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Duration error of TiCo across instructed-duration bins, comparing two reference signals: [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Text benchmarks: duration error of Qwen2-Omni-7B vs. TiCo measured against instructed [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Text benchmarks (TiCo): duration error measured against instructed duration vs. last time [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of different generation patterns in spoken dialogue models (SDMs): (a) [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

We introduce TiCo, a time-controllable spoken dialogue model (SDM) that follows time-constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions. To systematically evaluate this, we introduce TiCo-Bench, the first benchmark for time-controllable instruction following in SDMs, on which existing open-source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post-trained efficiently without question-answer paired data, relying on self-generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TiCo adds Spoken Time Markers and RL on self-generated data to let spoken dialogue models follow explicit duration instructions, with claimed big error drops, but the abstract gives almost no experimental details so the gains are hard to trust yet.

read the letter

The main takeaway is that TiCo inserts time markers like <10.6 seconds> during generation so the model can track elapsed time and adjust content on the fly, then uses reinforcement learning on its own outputs with a simple duration reward to train without paired data. It also releases TiCo-Bench to measure how well models hit time targets. That combination is new enough to be worth a look for anyone working on voice agents where response length matters for natural flow.

Referee Report

2 major / 1 minor

Summary. The paper introduces TiCo, a spoken dialogue model that enables control over response duration using Spoken Time Markers (STM) inserted during generation to track elapsed time and adjust content accordingly. It is trained via self-generation and reinforcement learning with a verifiable duration reward, without requiring paired question-answer data. The model is evaluated on the new TiCo-Bench benchmark, where it reportedly reduces duration error by 2.7 times compared to its backbone and 1.6 times over the strongest baseline, while preserving response quality.

Significance. If the central mechanism holds, the work offers a practical advance for spoken dialogue systems by addressing duration control, a key factor in user experience for voice assistants. The post-training strategy relying on self-generation and RL without paired data is efficient and avoids costly data collection. The introduction of TiCo-Bench also provides a new evaluation resource for time-controllable instruction following.

major comments (2)

[Abstract] Abstract: the central quantitative claims (2.7x duration error reduction over backbone, 1.6x over strongest baseline) are presented without any description of the experimental setup, exact baselines, test set size in TiCo-Bench, statistical significance testing, or error bars, preventing verification of the reported gains.
[Method] Method section on reinforcement learning: the duration reward is computed solely from the completed utterance's measured duration versus the target; this leaves open the possibility that optimization succeeds via superficial length matching (e.g., implicit rate changes or fillers) rather than genuine on-the-fly STM-based time estimation and content adjustment during autoregressive generation. An ablation removing STM access or inspecting intermediate marker predictions is required to substantiate the claimed mechanism.

minor comments (1)

[Abstract] Abstract: the example STM notation (<10.6 seconds>) should specify whether these are added as special tokens to the vocabulary and how they are tokenized during training and inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped us strengthen the presentation of our work. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claims (2.7x duration error reduction over backbone, 1.6x over strongest baseline) are presented without any description of the experimental setup, exact baselines, test set size in TiCo-Bench, statistical significance testing, or error bars, preventing verification of the reported gains.

Authors: We agree that the abstract would benefit from additional context on the experimental setup to improve verifiability. In the revised manuscript, we have expanded the abstract to briefly note the TiCo-Bench test set size (500 instructions), the main baselines (including the strongest open-source and commercial models), that results are reported as averages over 3 random seeds with standard error bars, and that improvements are statistically significant (p<0.01 via paired t-test). Full experimental details, including exact setup and significance testing, remain in Section 4 and the appendix as before. revision: yes
Referee: [Method] Method section on reinforcement learning: the duration reward is computed solely from the completed utterance's measured duration versus the target; this leaves open the possibility that optimization succeeds via superficial length matching (e.g., implicit rate changes or fillers) rather than genuine on-the-fly STM-based time estimation and content adjustment during autoregressive generation. An ablation removing STM access or inspecting intermediate marker predictions is required to substantiate the claimed mechanism.

Authors: We appreciate this insightful concern regarding the underlying mechanism. To directly address it, we have added a new ablation study in the revised manuscript (Section 4.3 and Appendix C) where STM tokens are masked during autoregressive generation while keeping the RL training otherwise identical. The ablation shows a 2.1x increase in duration error compared to the full TiCo model, confirming that performance relies on STM-based time tracking rather than superficial adjustments. We have also included an analysis of intermediate STM predictions (e.g., accuracy of elapsed-time markers at generation steps 10, 20, and 30), which correlate strongly with final duration accuracy. These results substantiate the on-the-fly estimation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL result with external verifiable reward

full rationale

The paper introduces STM as an architectural addition and trains via self-generation + RL using a duration reward computed from measured output length versus target. No equations, self-citations, or derivations are shown that define the target controllability in terms of the measured improvement or reduce the 2.7x error reduction to a fitted parameter by construction. The reward is externally verifiable (final duration) and independent of the model's internal STM usage during generation, so the claimed gain does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that time markers can be learned as part of generation and that RL with duration-based reward suffices for training without paired data.

axioms (1)

domain assumption Reinforcement learning with a verifiable duration reward can train time control effectively from self-generated data alone
Explicitly stated as the training strategy in the abstract.

invented entities (1)

Spoken Time Markers (STM) no independent evidence
purpose: Markers such as <10.6 seconds> inserted into generated text to provide the model with elapsed-time information during spoken response generation
New mechanism introduced to give the model time awareness; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5533 in / 1371 out tokens · 41306 ms · 2026-05-15T00:35:40.514928+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Spoken Time Markers (e.g., <10.6 seconds>) ... RLVR with verifiable reward ... R(g)_main = F(t_inst - t_last) where F is Gaussian
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage framework ... self-generation ... GRPO + CHORD regularization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
eess.AS 2026-04 unverdicted novelty 7.0

Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research

work page
[2]

Recent advances in speech language models: A survey

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943–13970, 2025

work page 2025
[3]

Wavchat: A survey of spoken dialogue models

Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, et al. Wavchat: A survey of spoken dialogue models. arXiv preprint arXiv:2411.13577, 2024

work page arXiv 2024
[4]

Intelligent personal assistants: A systematic literature review.Expert systems with applications, 147:113193, 2020

Allan de Barcelos Silva, Marcio Miguel Gomes, Cristiano André Da Costa, Rodrigo da Rosa Righi, Jorge Luis Victoria Barbosa, Gustavo Pessin, Geert De Doncker, and Gus- tavo Federizzi. Intelligent personal assistants: A systematic literature review.Expert systems with applications, 147:113193, 2020. 11

work page 2020
[5]

How generative ai voice agents will transform medicine.npj Digital Medicine, 8(1):353, 2025

Scott J Adams, Julián N Acosta, and Pranav Rajpurkar. How generative ai voice agents will transform medicine.npj Digital Medicine, 8(1):353, 2025

work page 2025
[6]

Lifebench: Evaluating length instruction following in large language models.arXiv preprint arXiv:2505.16234, 2025

Wei Zhang, Zhenhong Zhou, Kun Wang, Junfeng Fang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xavier Li, Li Sun, Lingjuan Lyu, et al. Lifebench: Evaluating length instruction following in large language models.arXiv preprint arXiv:2505.16234, 2025

work page arXiv 2025
[7]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Explaining length bias in LLM-based preference evaluations

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in LLM-based preference evaluations. InFindings of the Association for Computational Linguis- tics: EMNLP 2025, pages 6763–6794, November 2025

work page 2025
[9]

Prompt-based one-shot exact length-controlled generation with llms.arXiv preprint arXiv:2508.13805, 2025

Juncheng Xie and Hung-yi Lee. Prompt-based one-shot exact length-controlled generation with llms.arXiv preprint arXiv:2508.13805, 2025

work page arXiv 2025
[10]

Prompt-based length controlled generation with reinforcement learning.arXiv preprint arXiv:2308.12030, 2023

Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, and Qun Liu. Prompt-based length controlled generation with reinforcement learning.arXiv preprint arXiv:2308.12030, 2023

work page arXiv 2023
[11]

Hansel: Output length controlling framework for large language models

Seoha Song, Junhyun Lee, and Hyeonmok Ko. Hansel: Output length controlling framework for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25146–25154, 2025

work page 2025
[12]

Linguistic uses of segmental duration in english: Acoustic and perceptual evidence.The journal of the acoustical society of America, 59(5):1208–1221, 1976

Dennis H Klatt. Linguistic uses of segmental duration in english: Acoustic and perceptual evidence.The journal of the acoustical society of America, 59(5):1208–1221, 1976

work page 1976
[13]

Explaining phonetic variation: A sketch of the h&h theory

Björn Lindblom. Explaining phonetic variation: A sketch of the h&h theory. InSpeech production and speech modelling, pages 403–439. Springer, 1990

work page 1990
[14]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Challenges for spoken dialogue systems

James Glass. Challenges for spoken dialogue systems. InProceedings of the 1999 IEEE ASRU Workshop, volume 696. MIT Laboratory for Computer Science Cambridge, 1999

work page 1999
[16]

Speech resynthesis from discrete disentangled self-supervised representations

Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. InProc. Interspeech 2021, pages 3615–3619, 2021

work page 2021
[17]

Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023

work page 2023
[18]

Speechprompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks

Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, and Hung-yi Lee. Speechprompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks. arXiv preprint arXiv:2203.16773, 2022

work page arXiv 2022
[19]

Speechprompt: Prompting speech language models for speech processing tasks.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3730–3744, 2024

Kai-Wei Chang, Haibin Wu, Yu-Kai Wang, Yuan-Kuei Wu, Hua Shen, Wei-Cheng Tseng, Iu-thing Kang, Shang-Wen Li, and Hung-yi Lee. Speechprompt: Prompting speech language models for speech processing tasks.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3730–3744, 2024

work page 2024
[20]

Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation

Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann Lee. Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. 2022

work page 2022
[21]

STITCH: Simulta- neous thinking and talking with chunked reasoning for spoken language models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie LIU, Zhendong Wang, Zhengyuan Yang, Hung yi Lee, and Lijuan Wang. STITCH: Simulta- neous thinking and talking with chunked reasoning for spoken language models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=5Z1e...

work page 2026
[22]

Stream rag: Instant and accurate spoken dialogue systems with streaming tool usage.arXiv preprint arXiv:2510.02044,

Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang, Adithya Sagar, Surya Teja Appini, Kaushik Patnaik, et al. Stream rag: Instant and accurate spoken dialogue systems with streaming tool usage.arXiv preprint arXiv:2510.02044, 2025

work page arXiv 2025
[23]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Recent advances in discrete speech tokens: A review

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[26]

Codec-superb: An in-depth analysis of sound codec models

Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu- Hsuan Wang, Kai-Wei Chang, Alex Liu, and Hung-yi Lee. Codec-superb: An in-depth analysis of sound codec models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10330–10348, 2024

work page 2024
[27]

ArXiv:2503.04721

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[28]

Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

Shu-wen Yang, Ming Tu, Andy T Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu. Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025

work page arXiv 2025
[30]

F-Actor: Controllable Conversational Behaviour in Full-Duplex Models

Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch, and Tsz Kin Lam. F-actor: Controllable conversational behaviour in full-duplex models.arXiv preprint arXiv:2601.11329, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. Game-time: Evaluating temporal dynamics in spoken language models.arXiv preprint arXiv:2509.26388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. InInternational Conference on Learning Representations, 2021

work page 2021
[33]

Towards controllable speech synthesis in the era of large language models: A systematic survey

Tianxin Xie, Yan Rong, Pengfei Zhang, Wenwu Wang, and Li Liu. Towards controllable speech synthesis in the era of large language models: A systematic survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 764–791, 2025

work page 2025
[34]

Enhancing temporal understanding in audio question answering for large audio language models

Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser. Enhancing temporal understanding in audio question answering for large audio language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 1026–1035, 2025

work page 2025
[35]

Listening be- tween the frames: Bridging temporal gaps in large audio-language models.arXiv preprint arXiv:2511.11039, 2025

Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, and Xiangdong Wang. Listening be- tween the frames: Bridging temporal gaps in large audio-language models.arXiv preprint arXiv:2511.11039, 2025

work page arXiv 2025
[36]

Length controlled generation for black-box llms

Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Ting Liu, Bing Qin, and Tat-Seng Chua. Length controlled generation for black-box llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16878–16895, 2025. 13

work page 2025
[37]

Zero-shot strategies for length-controllable summarization

Fabian Retkowski and Alex Waibel. Zero-shot strategies for length-controllable summarization. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 551–572, 2025

work page 2025
[38]

Controlling summariza- tion length through eos token weighting.arXiv preprint arXiv:2506.05017, 2025

Zeno Belligoli, Emmanouil Stergiadis, Eran Fainman, and Ilya Gusev. Controlling summariza- tion length through eos token weighting.arXiv preprint arXiv:2506.05017, 2025

work page arXiv 2025
[39]

Precise length control for large language models.Natural Language Processing Journal, 11:100143, 2025

Bradley Butcher, Michael O’Keefe, and James Titchener. Precise length control for large language models.Natural Language Processing Journal, 11:100143, 2025

work page 2025
[40]

Positionid: Llms can control lengths, copy and paste with explicit positional awareness

Noah Wang, Feiyu Duan, Yibo Zhang, Wangchunshu Zhou, Ke Xu, Wenhao Huang, and Jie Fu. Positionid: Llms can control lengths, copy and paste with explicit positional awareness. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16877–16915, 2024

work page 2024
[41]

Length desensitization in direct preference optimization.arXiv preprint arXiv:2409.06411, 2024

Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, and Xunliang Cai. Length desensitization in direct preference optimization.arXiv preprint arXiv:2409.06411, 2024

work page arXiv 2024
[42]

Length-controlled margin-based preference optimization without reference model.arXiv preprint arXiv:2502.14643, 2025

Gengxu Li, Tingyu Xia, Yi Chang, and Yuan Wu. Length-controlled margin-based preference optimization without reference model.arXiv preprint arXiv:2502.14643, 2025

work page arXiv 2025
[43]

Laconic: Length-aware constrained reinforcement learning for llm.arXiv preprint arXiv:2602.14468, 2026

Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesvári, and Lin F Yang. Laconic: Length-aware constrained reinforcement learning for llm.arXiv preprint arXiv:2602.14468, 2026

work page arXiv 2026
[44]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025

work page arXiv 2025
[45]

L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025

work page arXiv 2025
[46]

Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability

Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024

work page arXiv 2024
[47]

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, et al. Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment. arXiv preprint arXiv:2507.02768, 2025

work page arXiv 2025
[48]

On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting, 2026

Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting, 2026. URL https://arxiv.org/ abs/2508.11408

work page arXiv 2026
[49]

Llama- omni: Seamless speech interaction with large language models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama- omni: Seamless speech interaction with large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[50]

URO-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. URO-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17211–17242, S...

work page doi:10.18653/v1/2025.findings-emnlp.933 2025
[51]

Update to gpt-5 system card: Gpt-5.2

OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, December 2025. URL https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/ oai_5_2_system-card.pdf. 14

work page 2025
[52]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

work page 2024
[53]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech

Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35139–35148, 2026

work page 2026
[54]

Swift:a scal- able lightweight infrastructure for fine-tuning, 2024

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. URL https://arxiv.org/abs/2408. 05517

work page 2024
[55]

whisper-timestamped

Jérôme Louradour. whisper-timestamped. https://github.com/linto-ai/ whisper-timestamped, 2023

work page 2023
[56]

Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023

work page 2023
[57]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, 2023

work page 2023
[58]

Spirit-lm: Interleaved spoken and written language model.Transactions of the Association for Computational Linguistics, 13:30–52, 2025

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, et al. Spirit-lm: Interleaved spoken and written language model.Transactions of the Association for Computational Linguistics, 13:30–52, 2025

work page 2025
[59]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents

Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21390– 21402, 2024

work page 2024
[61]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

work page arXiv 2024
[62]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm. InInternational Conference on Machine Learning, pages 63345–63354. PMLR, 2025

work page 2025
[63]

Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page arXiv 2024
[64]

Slam-omni: Timbre-controllable voice interaction system with single-stage training

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, et al. Slam-omni: Timbre-controllable voice interaction system with single-stage training. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2262–2282, 2025

work page 2025
[65]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

work page arXiv 2025
[66]

Baichuan-audio: A unified framework for end-to-end speech interaction

Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction.arXiv preprint arXiv:2502.17239, 2025. 15

work page arXiv 2025
[67]

Kimi-Audio Technical Report

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis

Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18617–18629, 2025

work page 2025
[69]

Step-audio 2 technical report, 2025

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

work page arXiv 2025
[70]

Can speech llms think while listening?arXiv preprint arXiv:2510.07497,

Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, and Mike Seltzer. Can speech llms think while listening?arXiv preprint arXiv:2510.07497, 2025

work page arXiv 2025
[71]

Chain-of-thought reason- ing in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066,

Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, and Shinji Watanabe. Chain-of-thought reasoning in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066, 2025

work page arXiv 2025
[72]

LFM2 technical report.arXiv:2511.23404, 2025

Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, et al. Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025

work page arXiv 2025
[73]

Mimo-audio: Audio language models are few-shot learners

Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, et al. Mimo-audio: Audio language models are few-shot learners. arXiv preprint arXiv:2512.23808, 2025

work page arXiv 2025
[74]

Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

work page arXiv 2026
[75]

Minicpm-o: A gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone

OpenBMB. Minicpm-o: A gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone. https://github.com/OpenBMB/MiniCPM-o,

work page
[76]

Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

work page 2023
[77]

Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025

Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, et al. Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025

work page arXiv 2025
[78]

Dual-tower

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech language models. InThe Twelfth International Conference on Learning Representations. 16 A Author Contributions All authors contributed significantly to the design of the method, benchmark construction, evaluation, and the writing and refinem...

work page 1989