Recognition: 2 theorem links
· Lean TheoremTiCo: Time-Controllable Spoken Dialogue Model
Pith reviewed 2026-05-15 00:35 UTC · model grok-4.3
The pith
TiCo lets spoken dialogue models follow time constraints such as 'generate a 15-second response' by inserting markers that track elapsed speaking time during generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TiCo post-trains a spoken dialogue model to insert Spoken Time Markers such as <10.6 seconds> during autoregressive generation; these markers supply explicit elapsed-time information that lets the model estimate remaining time and modulate content length to satisfy an instruction-specified target duration, all without paired training data and using only reinforcement learning on self-generated trajectories scored by a verifiable duration reward.
What carries the argument
Spoken Time Markers (STM) are special tokens inserted at each generation step that encode the cumulative speaking time so far, allowing the model to maintain an internal clock and adjust token choices to meet a target total duration.
If this is right
- Duration error drops by a factor of 2.7 relative to the original backbone model.
- Duration error drops by a factor of 1.6 relative to the strongest prior baseline.
- Response quality metrics remain statistically unchanged after the post-training stage.
- The method requires no paired question-answer data, relying only on self-generated trajectories and a verifiable reward.
- TiCo-Bench provides a standardized set of time-constrained instructions for evaluating future spoken dialogue models.
Where Pith is reading between the lines
- The same marker-plus-RL pattern could be applied to other verifiable constraints such as emotional tone or speaking rate without paired supervision.
- In deployed voice assistants, accurate duration control may reduce user interruptions and improve perceived turn-taking naturalness.
- Because training data are self-generated, the approach scales to new domains or languages as long as a duration verifier can be defined.
Load-bearing premise
That Spoken Time Markers inserted during generation give the model enough accurate time information to steer total duration without harming speech naturalness.
What would settle it
Measure actual audio durations of responses generated under explicit time targets and check whether the error remains at least 2.7 times smaller than the backbone model across a range of target lengths.
Figures
read the original abstract
We introduce TiCo, a time-controllable spoken dialogue model (SDM) that follows time-constrained instructions (e.g., "Please generate a response lasting about 15 seconds") and generates spoken responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions. To systematically evaluate this, we introduce TiCo-Bench, the first benchmark for time-controllable instruction following in SDMs, on which existing open-source and commercial models frequently fail to satisfy explicit time constraints. TiCo addresses this limitation by enabling an SDM to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is post-trained efficiently without question-answer paired data, relying on self-generation and reinforcement learning with verifiable reward. Experimental results show that TiCo reduces duration error by 2.7x over its backbone and 1.6x over the strongest baseline, while preserving response quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TiCo, a spoken dialogue model that enables control over response duration using Spoken Time Markers (STM) inserted during generation to track elapsed time and adjust content accordingly. It is trained via self-generation and reinforcement learning with a verifiable duration reward, without requiring paired question-answer data. The model is evaluated on the new TiCo-Bench benchmark, where it reportedly reduces duration error by 2.7 times compared to its backbone and 1.6 times over the strongest baseline, while preserving response quality.
Significance. If the central mechanism holds, the work offers a practical advance for spoken dialogue systems by addressing duration control, a key factor in user experience for voice assistants. The post-training strategy relying on self-generation and RL without paired data is efficient and avoids costly data collection. The introduction of TiCo-Bench also provides a new evaluation resource for time-controllable instruction following.
major comments (2)
- [Abstract] Abstract: the central quantitative claims (2.7x duration error reduction over backbone, 1.6x over strongest baseline) are presented without any description of the experimental setup, exact baselines, test set size in TiCo-Bench, statistical significance testing, or error bars, preventing verification of the reported gains.
- [Method] Method section on reinforcement learning: the duration reward is computed solely from the completed utterance's measured duration versus the target; this leaves open the possibility that optimization succeeds via superficial length matching (e.g., implicit rate changes or fillers) rather than genuine on-the-fly STM-based time estimation and content adjustment during autoregressive generation. An ablation removing STM access or inspecting intermediate marker predictions is required to substantiate the claimed mechanism.
minor comments (1)
- [Abstract] Abstract: the example STM notation (<10.6 seconds>) should specify whether these are added as special tokens to the vocabulary and how they are tokenized during training and inference.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which have helped us strengthen the presentation of our work. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central quantitative claims (2.7x duration error reduction over backbone, 1.6x over strongest baseline) are presented without any description of the experimental setup, exact baselines, test set size in TiCo-Bench, statistical significance testing, or error bars, preventing verification of the reported gains.
Authors: We agree that the abstract would benefit from additional context on the experimental setup to improve verifiability. In the revised manuscript, we have expanded the abstract to briefly note the TiCo-Bench test set size (500 instructions), the main baselines (including the strongest open-source and commercial models), that results are reported as averages over 3 random seeds with standard error bars, and that improvements are statistically significant (p<0.01 via paired t-test). Full experimental details, including exact setup and significance testing, remain in Section 4 and the appendix as before. revision: yes
-
Referee: [Method] Method section on reinforcement learning: the duration reward is computed solely from the completed utterance's measured duration versus the target; this leaves open the possibility that optimization succeeds via superficial length matching (e.g., implicit rate changes or fillers) rather than genuine on-the-fly STM-based time estimation and content adjustment during autoregressive generation. An ablation removing STM access or inspecting intermediate marker predictions is required to substantiate the claimed mechanism.
Authors: We appreciate this insightful concern regarding the underlying mechanism. To directly address it, we have added a new ablation study in the revised manuscript (Section 4.3 and Appendix C) where STM tokens are masked during autoregressive generation while keeping the RL training otherwise identical. The ablation shows a 2.1x increase in duration error compared to the full TiCo model, confirming that performance relies on STM-based time tracking rather than superficial adjustments. We have also included an analysis of intermediate STM predictions (e.g., accuracy of elapsed-time markers at generation steps 10, 20, and 30), which correlate strongly with final duration accuracy. These results substantiate the on-the-fly estimation claim. revision: yes
Circularity Check
No circularity: empirical RL result with external verifiable reward
full rationale
The paper introduces STM as an architectural addition and trains via self-generation + RL using a duration reward computed from measured output length versus target. No equations, self-citations, or derivations are shown that define the target controllability in terms of the measured improvement or reduce the 2.7x error reduction to a fitted parameter by construction. The reward is externally verifiable (final duration) and independent of the model's internal STM usage during generation, so the claimed gain does not collapse to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning with a verifiable duration reward can train time control effectively from self-generated data alone
invented entities (1)
-
Spoken Time Markers (STM)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Spoken Time Markers (e.g., <10.6 seconds>) ... RLVR with verifiable reward ... R(g)_main = F(t_inst - t_last) where F is Gaussian
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage framework ... self-generation ... GRPO + CHORD regularization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...
Reference graph
Works this paper leans on
-
[1]
Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spoken language models: A comprehensive survey.Transactions on Machine Learning Research
-
[2]
Recent advances in speech language models: A survey
Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943–13970, 2025
work page 2025
-
[3]
Wavchat: A survey of spoken dialogue models
Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, et al. Wavchat: A survey of spoken dialogue models. arXiv preprint arXiv:2411.13577, 2024
-
[4]
Allan de Barcelos Silva, Marcio Miguel Gomes, Cristiano André Da Costa, Rodrigo da Rosa Righi, Jorge Luis Victoria Barbosa, Gustavo Pessin, Geert De Doncker, and Gus- tavo Federizzi. Intelligent personal assistants: A systematic literature review.Expert systems with applications, 147:113193, 2020. 11
work page 2020
-
[5]
How generative ai voice agents will transform medicine.npj Digital Medicine, 8(1):353, 2025
Scott J Adams, Julián N Acosta, and Pranav Rajpurkar. How generative ai voice agents will transform medicine.npj Digital Medicine, 8(1):353, 2025
work page 2025
-
[6]
Wei Zhang, Zhenhong Zhou, Kun Wang, Junfeng Fang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xavier Li, Li Sun, Lingjuan Lyu, et al. Lifebench: Evaluating length instruction following in large language models.arXiv preprint arXiv:2505.16234, 2025
-
[7]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Explaining length bias in LLM-based preference evaluations
Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in LLM-based preference evaluations. InFindings of the Association for Computational Linguis- tics: EMNLP 2025, pages 6763–6794, November 2025
work page 2025
-
[9]
Juncheng Xie and Hung-yi Lee. Prompt-based one-shot exact length-controlled generation with llms.arXiv preprint arXiv:2508.13805, 2025
-
[10]
Renlong Jie, Xiaojun Meng, Lifeng Shang, Xin Jiang, and Qun Liu. Prompt-based length controlled generation with reinforcement learning.arXiv preprint arXiv:2308.12030, 2023
-
[11]
Hansel: Output length controlling framework for large language models
Seoha Song, Junhyun Lee, and Hyeonmok Ko. Hansel: Output length controlling framework for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25146–25154, 2025
work page 2025
-
[12]
Dennis H Klatt. Linguistic uses of segmental duration in english: Acoustic and perceptual evidence.The journal of the acoustical society of America, 59(5):1208–1221, 1976
work page 1976
-
[13]
Explaining phonetic variation: A sketch of the h&h theory
Björn Lindblom. Explaining phonetic variation: A sketch of the h&h theory. InSpeech production and speech modelling, pages 403–439. Springer, 1990
work page 1990
-
[14]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Challenges for spoken dialogue systems
James Glass. Challenges for spoken dialogue systems. InProceedings of the 1999 IEEE ASRU Workshop, volume 696. MIT Laboratory for Computer Science Cambridge, 1999
work page 1999
-
[16]
Speech resynthesis from discrete disentangled self-supervised representations
Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. Speech resynthesis from discrete disentangled self-supervised representations. InProc. Interspeech 2021, pages 3615–3619, 2021
work page 2021
-
[17]
Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, et al. Textually pretrained speech language models.Advances in Neural Information Processing Systems, 36:63483–63501, 2023
work page 2023
-
[18]
Kai-Wei Chang, Wei-Cheng Tseng, Shang-Wen Li, and Hung-yi Lee. Speechprompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks. arXiv preprint arXiv:2203.16773, 2022
-
[19]
Kai-Wei Chang, Haibin Wu, Yu-Kai Wang, Yuan-Kuei Wu, Hua Shen, Wei-Cheng Tseng, Iu-thing Kang, Shang-Wen Li, and Hung-yi Lee. Speechprompt: Prompting speech language models for speech processing tasks.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3730–3744, 2024
work page 2024
-
[20]
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann Lee. Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. 2022
work page 2022
-
[21]
STITCH: Simulta- neous thinking and talking with chunked reasoning for spoken language models
Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie LIU, Zhendong Wang, Zhengyuan Yang, Hung yi Lee, and Lijuan Wang. STITCH: Simulta- neous thinking and talking with chunked reasoning for spoken language models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=5Z1e...
work page 2026
-
[22]
Siddhant Arora, Haidar Khan, Kai Sun, Xin Luna Dong, Sajal Choudhary, Seungwhan Moon, Xinyuan Zhang, Adithya Sagar, Surya Teja Appini, Kaushik Patnaik, et al. Stream rag: Instant and accurate spoken dialogue systems with streaming tool usage.arXiv preprint arXiv:2510.02044, 2025
-
[23]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025. URLhttps://arxiv.org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Recent advances in discrete speech tokens: A review
Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[26]
Codec-superb: An in-depth analysis of sound codec models
Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu- Hsuan Wang, Kai-Wei Chang, Alex Liu, and Hung-yi Lee. Codec-superb: An in-depth analysis of sound codec models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10330–10348, 2024
work page 2024
-
[27]
Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025
-
[28]
Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Shu-wen Yang, Ming Tu, Andy T Liu, Xinghua Qu, Hung-yi Lee, Lu Lu, Yuxuan Wang, and Yonghui Wu. Paras2s: Benchmarking and aligning spoken language models for paralinguistic- aware speech-to-speech interaction.arXiv preprint arXiv:2511.08723, 2025
-
[30]
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch, and Tsz Kin Lam. F-actor: Controllable conversational behaviour in full-duplex models.arXiv preprint arXiv:2601.11329, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. Game-time: Evaluating temporal dynamics in spoken language models.arXiv preprint arXiv:2509.26388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Fastspeech 2: Fast and high-quality end-to-end text to speech
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. InInternational Conference on Learning Representations, 2021
work page 2021
-
[33]
Towards controllable speech synthesis in the era of large language models: A systematic survey
Tianxin Xie, Yan Rong, Pengfei Zhang, Wenwu Wang, and Li Liu. Towards controllable speech synthesis in the era of large language models: A systematic survey. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 764–791, 2025
work page 2025
-
[34]
Enhancing temporal understanding in audio question answering for large audio language models
Arvind Krishna Sridhar, Yinyi Guo, and Erik Visser. Enhancing temporal understanding in audio question answering for large audio language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 1026–1035, 2025
work page 2025
-
[35]
Hualei Wang, Yiming Li, Shuo Ma, Hong Liu, and Xiangdong Wang. Listening be- tween the frames: Bridging temporal gaps in large audio-language models.arXiv preprint arXiv:2511.11039, 2025
-
[36]
Length controlled generation for black-box llms
Yuxuan Gu, Wenjie Wang, Xiaocheng Feng, Weihong Zhong, Kun Zhu, Lei Huang, Ting Liu, Bing Qin, and Tat-Seng Chua. Length controlled generation for black-box llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16878–16895, 2025. 13
work page 2025
-
[37]
Zero-shot strategies for length-controllable summarization
Fabian Retkowski and Alex Waibel. Zero-shot strategies for length-controllable summarization. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 551–572, 2025
work page 2025
-
[38]
Controlling summariza- tion length through eos token weighting.arXiv preprint arXiv:2506.05017, 2025
Zeno Belligoli, Emmanouil Stergiadis, Eran Fainman, and Ilya Gusev. Controlling summariza- tion length through eos token weighting.arXiv preprint arXiv:2506.05017, 2025
-
[39]
Bradley Butcher, Michael O’Keefe, and James Titchener. Precise length control for large language models.Natural Language Processing Journal, 11:100143, 2025
work page 2025
-
[40]
Positionid: Llms can control lengths, copy and paste with explicit positional awareness
Noah Wang, Feiyu Duan, Yibo Zhang, Wangchunshu Zhou, Ke Xu, Wenhao Huang, and Jie Fu. Positionid: Llms can control lengths, copy and paste with explicit positional awareness. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16877–16915, 2024
work page 2024
-
[41]
Length desensitization in direct preference optimization.arXiv preprint arXiv:2409.06411, 2024
Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, and Xunliang Cai. Length desensitization in direct preference optimization.arXiv preprint arXiv:2409.06411, 2024
-
[42]
Gengxu Li, Tingyu Xia, Yi Chang, and Yuan Wu. Length-controlled margin-based preference optimization without reference model.arXiv preprint arXiv:2502.14643, 2025
-
[43]
Chang Liu, Yiran Zhao, Lawrence Liu, Yaoqi Ye, Csaba Szepesvári, and Lin F Yang. Laconic: Length-aware constrained reinforcement learning for llm.arXiv preprint arXiv:2602.14468, 2026
-
[44]
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025
-
[45]
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025
-
[46]
Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, and Lin Ma. Timemarker: A versatile video-llm for long and short video understanding with superior temporal localization ability. arXiv preprint arXiv:2411.18211, 2024
-
[47]
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, et al. Desta2. 5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment. arXiv preprint arXiv:2507.02768, 2025
-
[48]
Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting, 2026. URL https://arxiv.org/ abs/2508.11408
-
[49]
Llama- omni: Seamless speech interaction with large language models
Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama- omni: Seamless speech interaction with large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[50]
URO-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models
Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. URO-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17211–17242, S...
-
[51]
Update to gpt-5 system card: Gpt-5.2
OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, December 2025. URL https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/ oai_5_2_system-card.pdf. 14
work page 2025
-
[52]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[53]
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35139–35148, 2026
work page 2026
-
[54]
Swift:a scal- able lightweight infrastructure for fine-tuning, 2024
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. URL https://arxiv.org/abs/2408. 05517
work page 2024
-
[55]
Jérôme Louradour. whisper-timestamped. https://github.com/linto-ai/ whisper-timestamped, 2023
work page 2023
-
[56]
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023
work page 2023
-
[57]
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, 2023
work page 2023
-
[58]
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R Costa-Jussa, Maha Elbayad, Sravya Popuri, Christophe Ropers, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, et al. Spirit-lm: Interleaved spoken and written language model.Transactions of the Association for Computational Linguistics, 13:30–52, 2025
work page 2025
-
[59]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents
Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21390– 21402, 2024
work page 2024
-
[61]
Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024
-
[62]
Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm
Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm. InInternational Conference on Machine Learning, pages 63345–63354. PMLR, 2025
work page 2025
-
[63]
Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024
-
[64]
Slam-omni: Timbre-controllable voice interaction system with single-stage training
Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, et al. Slam-omni: Timbre-controllable voice interaction system with single-stage training. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2262–2282, 2025
work page 2025
-
[65]
Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025
-
[66]
Baichuan-audio: A unified framework for end-to-end speech interaction
Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction.arXiv preprint arXiv:2502.17239, 2025. 15
-
[67]
Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis
Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. Llama-omni 2: Llm- based real-time spoken chatbot with autoregressive streaming speech synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18617–18629, 2025
work page 2025
-
[69]
Step-audio 2 technical report, 2025
Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025
-
[70]
Can speech llms think while listening?arXiv preprint arXiv:2510.07497,
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, and Mike Seltzer. Can speech llms think while listening?arXiv preprint arXiv:2510.07497, 2025
-
[71]
Siddhant Arora, Jinchuan Tian, Hayato Futami, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, and Shinji Watanabe. Chain-of-thought reasoning in streaming full-duplex end-to-end spoken dialogue systems.arXiv preprint arXiv:2510.02066, 2025
-
[72]
LFM2 technical report.arXiv:2511.23404, 2025
Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, et al. Lfm2 technical report.arXiv preprint arXiv:2511.23404, 2025
-
[73]
Mimo-audio: Audio language models are few-shot learners
Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, et al. Mimo-audio: Audio language models are few-shot learners. arXiv preprint arXiv:2512.23808, 2025
-
[74]
Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026
-
[75]
OpenBMB. Minicpm-o: A gemini 2.5 flash level mllm for vision, speech, and full-duplex multimodal live streaming on your phone. https://github.com/OpenBMB/MiniCPM-o,
-
[76]
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023
work page 2023
-
[77]
Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025
Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, et al. Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025
-
[78]
Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech language models. InThe Twelfth International Conference on Learning Representations. 16 A Author Contributions All authors contributed significantly to the design of the method, benchmark construction, evaluation, and the writing and refinem...
work page 1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.