pith. machine review for the scientific record. sign in

arxiv: 2512.23578 · v3 · submitted 2025-12-29 · 💻 cs.CL · cs.SD

Recognition: 2 theorem links

· Lean Theorem

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:23 UTC · model grok-4.3

classification 💻 cs.CL cs.SD
keywords spoken language modelsstyle amnesiamulti-turn conversationsspeaking styleparalinguistic featuresstyle adherenceconversation consistency
0
0 comments X

The pith

Spoken language models lose instructed speaking styles after several conversation turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that spoken language models instructed at the start of a conversation to use a specific style cannot keep that style consistent as the exchange continues. This style amnesia appears for paralinguistic features such as emotion, accent, volume, and speaking speed. It affects both proprietary and open-source models, and the models follow instructions less reliably when the style directive sits in a system message rather than a user message. Even when the models later recall the original instruction, they still do not express the style unless the recall is made explicit, at which point adherence improves.

Core claim

When spoken language models receive an initial instruction to speak in a given style, they exhibit style amnesia and stop maintaining that style after several turns of interaction. This holds across styles including emotion, accent, volume, and speaking speed. The models can retrieve the instruction when prompted later but fail to apply it unless the prompt explicitly recalls it; explicit recall therefore mitigates the amnesia. Adherence is weaker when the style instruction appears in system messages than when it appears in user messages.

What carries the argument

Style amnesia, the progressive loss of adherence to an initial speaking-style instruction over multiple turns in spoken language models.

If this is right

  • Explicitly prompting the model to recall the style instruction in later turns restores better style expression.
  • Style instructions placed in user messages are followed more consistently than the same instructions placed in system messages.
  • No model among the three proprietary and two open-source systems tested sustains the instructed style without mitigation.
  • Future spoken language models require new mechanisms to preserve paralinguistic style across extended interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applications that rely on consistent character voices or emotional tone may see reduced user trust as styles shift mid-conversation.
  • The same degradation could affect other persistent conversation-level constraints beyond speaking style.
  • Adding a dedicated memory store for style parameters might reduce the need for repeated explicit reminders.
  • Testing longer dialogues would reveal how rapidly style loss occurs and whether it stabilizes or worsens.

Load-bearing premise

The prompts and style judgments used in the evaluation isolate adherence to speaking style from other model limitations such as general instruction following or output quality.

What would settle it

Run a test in which models receive a style instruction at turn one, then generate responses for five or more turns without any reminder, and measure whether human or automatic judges detect the original style in the later outputs at rates above chance.

Figures

Figures reproduced from arXiv: 2512.23578 by Cheng-Han Chiang, Hung-yi Lee, Yu-Xiang Lin.

Figure 1
Figure 1. Figure 1: When instructed to consistently speak sadly [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of the evaluation framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The first-turn IF rate IF1 and degradation rate D across different speaking styles. 1 2 3 4 Assistant Turn 0 10 20 30 40 50 IF Rate (%) Anger 1 2 3 4 Assistant Turn 0 20 40 60 80 100 Fast Qwen2.5-Omni Gemini Live GPT-4o GPT-4o mini Step-Audio 2 mini [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of style amnesia. 5 Analysis 5.1 The Effect of Prompt Position In this section, we conduct experiments under dif￾ferent prompt positions to investigate their effect on style amnesia. In instruction-guided language models, system messages are designed with higher priority than user messages to establish global behaviors and safety constraints (Touvron et al., 2023; Wallace et al., 2024). A pri… view at source ↗
Figure 5
Figure 5. Figure 5: The difference of first-turn IF IF1 rate when instructions are placed in system and user messages. Task Model Cohen’s Kappa MCC Accent Gemini-2.5 Pro 0.741 0.747 Voxlect 0.809 0.811 Emotion Gemini-2.5 Pro 0.464 0.487 Emotion2vec 0.476 0.511 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The illustration of the recall process. Gemini-2.5 Pro, achieving the highest agreement with human annotations, with a Cohen’s Kappa of 0.809 and an MCC of 0.811. In the emotion classi￾fication task, Emotion2vec-Large demonstrates the strongest reliability among three emotion judges, obtaining a Cohen’s Kappa of 0.476 and an MCC of 0.511. These results are similar to the crowd￾sourced agreement levels foun… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for dataset construction. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for dialogue coherence evaluation [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for Recall Evaluation 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Instruction page [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Annotation page 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that while SLMs can recall the style instruction when prompted in later turns, they still fail to express it, but through explicit recall can mitigate style amnesia. In addition, SLMs struggle more when the style instruction is placed in system messages rather than user messages, even though system messages are specifically designed to provide persistent, conversation-level instructions. Our findings highlight a systematic gap in current SLMs' ability to maintain speaking styles, highlighting the need for improved style adherence in future models. Our code and evaluation data are publicly available at https://github.com/YuXiangLin1234/SLM-Style-Amnesia.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that spoken language models (SLMs) suffer from 'style amnesia,' failing to maintain instructed paralinguistic speaking styles (emotion, accent, volume, speaking speed) over multi-turn conversations despite initial instructions. Evaluations across three proprietary and two open-source models show none maintain consistency; models recall instructions but fail to express them, with explicit recall as mitigation. Performance is worse when instructions are in system messages than user messages. Code and data are publicly released.

Significance. If the results hold, this identifies a systematic limitation in current SLMs for preserving paralinguistic style in interactive settings, relevant for natural spoken dialogue systems. The multi-model empirical evaluation with public code and data provides a reproducible benchmark that can guide improvements in style adherence.

major comments (2)
  1. [§4] §4 (Evaluation): The style adherence scores (e.g., Table 2) demonstrate degradation, but without ablations comparing style-instructed vs. neutral-instruction conditions or controls for overall coherence decline, it remains unclear whether the measurements isolate style-specific failure from general multi-turn instruction following or generation quality limitations.
  2. [§5.2] §5.2 (Mitigation): The explicit recall experiments show improved style expression, but the paper should report whether recall prompts affect secondary metrics such as response relevance or turn coherence to confirm the mitigation is style-targeted rather than a general prompt engineering effect.
minor comments (2)
  1. [Figures] Figure 1 and 2: Add explicit turn-number annotations and error bars to clarify degradation trends across models.
  2. [Abstract] Abstract: Specify the exact automatic metrics and human evaluation protocol used for style judgment to allow readers to assess reliability upfront.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to strengthen the evaluation and mitigation analyses as suggested.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): The style adherence scores (e.g., Table 2) demonstrate degradation, but without ablations comparing style-instructed vs. neutral-instruction conditions or controls for overall coherence decline, it remains unclear whether the measurements isolate style-specific failure from general multi-turn instruction following or generation quality limitations.

    Authors: We agree that the current evaluation would benefit from explicit controls to isolate style-specific degradation. In the revised manuscript, we will add an ablation comparing style-instructed conversations to neutral-instruction baselines, reporting style adherence scores under both conditions. We will also include secondary metrics for overall coherence and response quality across turns to control for general multi-turn decline, ensuring the style amnesia effect is distinguished from broader instruction-following limitations. revision: yes

  2. Referee: [§5.2] §5.2 (Mitigation): The explicit recall experiments show improved style expression, but the paper should report whether recall prompts affect secondary metrics such as response relevance or turn coherence to confirm the mitigation is style-targeted rather than a general prompt engineering effect.

    Authors: We concur that reporting secondary metrics is necessary to confirm the mitigation is style-specific. In the revised version, we will include measurements of response relevance and turn coherence for the explicit recall condition alongside the style adherence improvements. This will demonstrate that the gains in style expression do not result from general prompt engineering effects that might degrade other aspects of conversation quality. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of model behavior

full rationale

The paper conducts an empirical study by prompting SLMs with style instructions and measuring degradation via automatic and human judgments across turns. No equations, derivations, fitted parameters, or self-citations are used to derive the central claim; results follow directly from observed outputs on the evaluated models. The work is self-contained against external benchmarks (model responses) with public code and data, satisfying the criteria for an independent empirical finding with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical observation rather than mathematical derivation; the main background assumption is that current SLMs are capable of interpreting style instructions at all.

axioms (1)
  • domain assumption SLMs can interpret and attempt to follow paralinguistic style instructions given in prompts
    Required for the experimental setup to be meaningful; invoked when models are instructed at the start of conversations.
invented entities (1)
  • style amnesia no independent evidence
    purpose: Label for the observed degradation of speaking style over multi-turn interactions
    New descriptive term introduced to characterize the empirical finding; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5513 in / 1169 out tokens · 62465 ms · 2026-05-16T19:23:35.289183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

  1. [1]

    Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335--359

  2. [2]

    Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377--390

  3. [3]

    Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. 2025. Game-time: Evaluating temporal dynamics in spoken language models. arXiv preprint arXiv:2509.26388

  4. [4]

    Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, Zehan Wang, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, and 1 others. 2025. Voxdialogue: Can spoken dialogue systems understand information beyond words? In The Thirteenth International Conference on Learning Representations

  5. [5]

    Cheng-Han Chiang and Hung-yi Lee. 2023. https://doi.org/10.18653/v1/2023.acl-long.870 Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631, Toronto, Canada. Association for Computational Linguistics

  6. [6]

    Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, and Lijuan Wang. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.25 Audio-aware large language models as judges for speaking styles . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages...

  7. [7]

    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37--46

  8. [8]

    Gheorghe Comanici and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  9. [9]

    Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

  10. [10]

    Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, and Eng Siong Chng. 2025. Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models. arXiv preprint arXiv:2511.00850

  11. [11]

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2025. Llama-omni: Seamless speech interaction with large language models. In The Thirteenth International Conference on Learning Representations

  12. [12]

    Tiantian Feng, Kevin Huang, Anfeng Xu, Xuan Shi, Thanathai Lertpetchpun, Jihwan Lee, Yoonjeong Lee, Dani Byrd, and Shrikanth Narayanan. 2025 a . Voxlect: A speech foundation model benchmark for modeling dialects and regional languages around the globe. arXiv preprint arXiv:2508.01691

  13. [13]

    Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd, and 1 others. 2025 b . Vox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits. arXiv preprint arXiv:2505.14648

  14. [14]

    Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378

  15. [15]

    Google . 2025 a . Advanced audio dialog and generation with Gemini 2.5 . https://blog.google/technology/google-deepmind/gemini-2-5-native-audio/

  16. [16]

    Google . 2025 b . Gemini Live: A more helpful, natural and visual assistant . https://blog.google/products/gemini/gemini-live-updates-august-2025/

  17. [17]

    Chi Han, Xin Liu, Haodong Wang, Shiyang Li, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Qingyu Yin, Liang Qiu, Changlong Yu, Yifan Gao, Zheng Li, Bing Yin, Jingbo Shang, and Heng Ji. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1387 Can language models follow multiple turns of entangled instructions? In Findings of the Association for Computati...

  18. [18]

    Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, and 1 others. 2025. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946

  19. [19]

    Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, and Sungroh Yoon. 2025. https://doi.org/10.18653/v1/2025.findings-acl.470 Does your voice assistant remember? analyzing conversational context recall and utilization in voice interaction models . In Findings of the Association for Computational Linguistics: ACL 2025, pages 898...

  20. [20]

    Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, and 1 others. 2023. Soda: Million-scale dialogue distillation with social commonsense contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12930--12949

  21. [21]

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in neural information processing systems, volume 33, pages 17022--17033

  22. [22]

    Klaus Krippendorff. 1980. Content analysis: An introduction to its methodology. Sage publications

  23. [23]

    Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1124 MT -eval: A multi-turn capabilities evaluation benchmark for large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages...

  24. [24]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  25. [25]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120

  26. [26]

    Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, and Yuan Wu. 2025. https://doi.org/10.18653/v1/2025.findings-acl.486 S truct F low B ench: A structured flow benchmark for multi-turn instruction following . In Findings of the Association for Computational Linguistics: ACL 2025, pages 9322--9341, Vienna, Austria. Association for Computational Linguistics

  27. [27]

    Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu Huang, Xuanjun Chen, and Hung-yi Lee. 2025. A preliminary exploration with gpt-4o voice mode. arXiv preprint arXiv:2502.09940

  28. [28]

    Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. 2025. Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models. arXiv preprint arXiv:2505.15727

  29. [29]

    Chengqian Ma, Wei Tao, and Steven Y Guo. 2025. C3: A bilingual benchmark for spoken dialogue models exploring challenges in complex conversations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22789--22807

  30. [30]

    Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2024. emotion2vec: Self-supervised pre-training for speech emotion representation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 15747--15760

  31. [31]

    Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442--451

  32. [32]

    Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, and Jatinder Singh. 2025. https://doi.org/10.1145/3715275.3732038 Position is power: System prompts as a mechanism of bias in large language models (llms) . In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT '25, page 573–598, New York, NY, USA. Association ...

  33. [33]

    Nvidia . 2025. NVIDIA Speech AI Models Deliver Industry-Leading Accuracy and Performance . https://developer.nvidia.com/blog/nvidia-speech-ai-models-deliver-industry-\ -accuracy-and-performance/

  34. [34]

    OpenAI . 2024. GPT-4o mini: Advancing Cost-Efficient Intelligence . https://openai.com/index/gpt-4o-mini-advancing\\-cost-efficient-intelligence/

  35. [35]

    OpenAI. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

  36. [36]

    OpenAI . 2025. Introducing GPT-5 . https://openai.com/index/introducing-gpt-5/

  37. [37]

    Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 527--536

  38. [38]

    Ayesha Qamar, Jonathan Tong, and Ruihong Huang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1271 Do LLM s understand dialogues? a case study on dialogue acts . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26219--26237, Vienna, Austria. Association for Computational Linguistics

  39. [39]

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2022. Utmos: Utokyo-sarulab system for voicemos challenge 2022. In Proc. Interspeech 2022, pages 4521--4525

  40. [40]

    Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, and Yongbin Li. 2023. Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue agents. Advances in Neural Information Processing Systems, 36:39088--39118

  41. [41]

    Steinmetz and Joshua D

    Christian J. Steinmetz and Joshua D. Reiss. 2021. pyloudnorm: A simple yet flexible loudness meter in python. In 150th AES Convention

  42. [42]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  43. [43]

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208

  44. [44]

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, and 1 others. 2025. Step-audio 2 technical report. arXiv preprint arXiv:2507.16632

  45. [45]

    Jin Xu and 1 others. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215

  46. [46]

    Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.933 URO -bench: Towards comprehensive evaluation for end-to-end spoken dialogue models . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17211--17242, Suzhou, China. Association fo...

  47. [47]

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612

  48. [48]

    Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, and 1 others. 2025. Vstyle: A benchmark for voice style adaptation with spoken instructions. arXiv preprint arXiv:2509.09716

  49. [49]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  50. [50]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...