arxiv: 2512.23578 · v3 · submitted 2025-12-29 · 💻 cs.CL · cs.SD

Recognition: 2 theorem links

· Lean Theorem

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

Yu-Xiang Lin , Cheng-Han Chiang , Hung-yi Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:23 UTC · model grok-4.3

classification 💻 cs.CL cs.SD

keywords spoken language modelsstyle amnesiamulti-turn conversationsspeaking styleparalinguistic featuresstyle adherenceconversation consistency

0 comments

The pith

Spoken language models lose instructed speaking styles after several conversation turns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that spoken language models instructed at the start of a conversation to use a specific style cannot keep that style consistent as the exchange continues. This style amnesia appears for paralinguistic features such as emotion, accent, volume, and speaking speed. It affects both proprietary and open-source models, and the models follow instructions less reliably when the style directive sits in a system message rather than a user message. Even when the models later recall the original instruction, they still do not express the style unless the recall is made explicit, at which point adherence improves.

Core claim

When spoken language models receive an initial instruction to speak in a given style, they exhibit style amnesia and stop maintaining that style after several turns of interaction. This holds across styles including emotion, accent, volume, and speaking speed. The models can retrieve the instruction when prompted later but fail to apply it unless the prompt explicitly recalls it; explicit recall therefore mitigates the amnesia. Adherence is weaker when the style instruction appears in system messages than when it appears in user messages.

What carries the argument

Style amnesia, the progressive loss of adherence to an initial speaking-style instruction over multiple turns in spoken language models.

If this is right

Explicitly prompting the model to recall the style instruction in later turns restores better style expression.
Style instructions placed in user messages are followed more consistently than the same instructions placed in system messages.
No model among the three proprietary and two open-source systems tested sustains the instructed style without mitigation.
Future spoken language models require new mechanisms to preserve paralinguistic style across extended interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that rely on consistent character voices or emotional tone may see reduced user trust as styles shift mid-conversation.
The same degradation could affect other persistent conversation-level constraints beyond speaking style.
Adding a dedicated memory store for style parameters might reduce the need for repeated explicit reminders.
Testing longer dialogues would reveal how rapidly style loss occurs and whether it stabilizes or worsens.

Load-bearing premise

The prompts and style judgments used in the evaluation isolate adherence to speaking style from other model limitations such as general instruction following or output quality.

What would settle it

Run a test in which models receive a style instruction at turn one, then generate responses for five or more turns without any reminder, and measure whether human or automatic judges detect the original style in the later outputs at rates above chance.

Figures

Figures reproduced from arXiv: 2512.23578 by Cheng-Han Chiang, Hung-yi Lee, Yu-Xiang Lin.

**Figure 2.** Figure 2: The overview of the evaluation framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The first-turn IF rate IF1 and degradation rate D across different speaking styles. 1 2 3 4 Assistant Turn 0 10 20 30 40 50 IF Rate (%) Anger 1 2 3 4 Assistant Turn 0 20 40 60 80 100 Fast Qwen2.5-Omni Gemini Live GPT-4o GPT-4o mini Step-Audio 2 mini [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of style amnesia. 5 Analysis 5.1 The Effect of Prompt Position In this section, we conduct experiments under different prompt positions to investigate their effect on style amnesia. In instruction-guided language models, system messages are designed with higher priority than user messages to establish global behaviors and safety constraints (Touvron et al., 2023; Wallace et al., 2024). A pri… view at source ↗

**Figure 5.** Figure 5: The difference of first-turn IF IF1 rate when instructions are placed in system and user messages. Task Model Cohen’s Kappa MCC Accent Gemini-2.5 Pro 0.741 0.747 Voxlect 0.809 0.811 Emotion Gemini-2.5 Pro 0.464 0.487 Emotion2vec 0.476 0.511 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The illustration of the recall process. Gemini-2.5 Pro, achieving the highest agreement with human annotations, with a Cohen’s Kappa of 0.809 and an MCC of 0.811. In the emotion classification task, Emotion2vec-Large demonstrates the strongest reliability among three emotion judges, obtaining a Cohen’s Kappa of 0.476 and an MCC of 0.511. These results are similar to the crowdsourced agreement levels foun… view at source ↗

**Figure 7.** Figure 7: Prompt for dataset construction. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for dialogue coherence evaluation [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for Recall Evaluation 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Instruction page [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Annotation page 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that while SLMs can recall the style instruction when prompted in later turns, they still fail to express it, but through explicit recall can mitigate style amnesia. In addition, SLMs struggle more when the style instruction is placed in system messages rather than user messages, even though system messages are specifically designed to provide persistent, conversation-level instructions. Our findings highlight a systematic gap in current SLMs' ability to maintain speaking styles, highlighting the need for improved style adherence in future models. Our code and evaluation data are publicly available at https://github.com/YuXiangLin1234/SLM-Style-Amnesia.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLMs lose instructed speaking styles after a few turns even when they recall the prompt, with system messages performing worse than user ones and explicit recall giving partial relief.

read the letter

The main thing here is that spoken language models lose instructed speaking styles like emotion or speed after several turns, even when they remember the instruction. The authors call this style amnesia and show it across five models, with system prompts faring worse than user ones and explicit recall offering partial help. What the paper does well is lay out a clear empirical pattern with public code and data. The comparison of message placement and the mitigation observation are specific enough to be new relative to earlier work on spoken models. This matters for practical voice applications that need consistent tone over time. The soft spot is in the measurements. The style judgments might be capturing broader multi-turn problems, such as weaker overall instruction following or lower generation quality, rather than style alone. The paper addresses part of this by noting recall without expression, but more targeted ablations on response quality would help confirm the effect is style-specific. The details of the automatic or human evaluations are in the code, so that is where to check the strength. This is useful for researchers and developers working on multi-turn spoken dialogue systems. It flags a limitation worth addressing in future models. I would send it for peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that spoken language models (SLMs) suffer from 'style amnesia,' failing to maintain instructed paralinguistic speaking styles (emotion, accent, volume, speaking speed) over multi-turn conversations despite initial instructions. Evaluations across three proprietary and two open-source models show none maintain consistency; models recall instructions but fail to express them, with explicit recall as mitigation. Performance is worse when instructions are in system messages than user messages. Code and data are publicly released.

Significance. If the results hold, this identifies a systematic limitation in current SLMs for preserving paralinguistic style in interactive settings, relevant for natural spoken dialogue systems. The multi-model empirical evaluation with public code and data provides a reproducible benchmark that can guide improvements in style adherence.

major comments (2)

[§4] §4 (Evaluation): The style adherence scores (e.g., Table 2) demonstrate degradation, but without ablations comparing style-instructed vs. neutral-instruction conditions or controls for overall coherence decline, it remains unclear whether the measurements isolate style-specific failure from general multi-turn instruction following or generation quality limitations.
[§5.2] §5.2 (Mitigation): The explicit recall experiments show improved style expression, but the paper should report whether recall prompts affect secondary metrics such as response relevance or turn coherence to confirm the mitigation is style-targeted rather than a general prompt engineering effect.

minor comments (2)

[Figures] Figure 1 and 2: Add explicit turn-number annotations and error bars to clarify degradation trends across models.
[Abstract] Abstract: Specify the exact automatic metrics and human evaluation protocol used for style judgment to allow readers to assess reliability upfront.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have made revisions to strengthen the evaluation and mitigation analyses as suggested.

read point-by-point responses

Referee: [§4] §4 (Evaluation): The style adherence scores (e.g., Table 2) demonstrate degradation, but without ablations comparing style-instructed vs. neutral-instruction conditions or controls for overall coherence decline, it remains unclear whether the measurements isolate style-specific failure from general multi-turn instruction following or generation quality limitations.

Authors: We agree that the current evaluation would benefit from explicit controls to isolate style-specific degradation. In the revised manuscript, we will add an ablation comparing style-instructed conversations to neutral-instruction baselines, reporting style adherence scores under both conditions. We will also include secondary metrics for overall coherence and response quality across turns to control for general multi-turn decline, ensuring the style amnesia effect is distinguished from broader instruction-following limitations. revision: yes
Referee: [§5.2] §5.2 (Mitigation): The explicit recall experiments show improved style expression, but the paper should report whether recall prompts affect secondary metrics such as response relevance or turn coherence to confirm the mitigation is style-targeted rather than a general prompt engineering effect.

Authors: We concur that reporting secondary metrics is necessary to confirm the mitigation is style-specific. In the revised version, we will include measurements of response relevance and turn coherence for the explicit recall condition alongside the style adherence improvements. This will demonstrate that the gains in style expression do not result from general prompt engineering effects that might degrade other aspects of conversation quality. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of model behavior

full rationale

The paper conducts an empirical study by prompting SLMs with style instructions and measuring degradation via automatic and human judgments across turns. No equations, derivations, fitted parameters, or self-citations are used to derive the central claim; results follow directly from observed outputs on the evaluated models. The work is self-contained against external benchmarks (model responses) with public code and data, satisfying the criteria for an independent empirical finding with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical observation rather than mathematical derivation; the main background assumption is that current SLMs are capable of interpreting style instructions at all.

axioms (1)

domain assumption SLMs can interpret and attempt to follow paralinguistic style instructions given in prompts
Required for the experimental setup to be meaningful; invoked when models are instructed at the start of conversations.

invented entities (1)

style amnesia no independent evidence
purpose: Label for the observed degradation of speaking style over multi-turn interactions
New descriptive term introduced to characterize the empirical finding; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5513 in / 1169 out tokens · 62465 ms · 2026-05-16T19:23:35.289183+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

none of these models can maintain a consistent speaking style when instructed to do so
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

style amnesia... IF rate degrades rapidly in subsequent turns

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

[1]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335--359

work page 2008
[2]

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377--390

work page 2014
[3]

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. 2025. Game-time: Evaluating temporal dynamics in spoken language models. arXiv preprint arXiv:2509.26388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Xize Cheng, Ruofan Hu, Xiaoda Yang, Jingyu Lu, Dongjie Fu, Zehan Wang, Shengpeng Ji, Rongjie Huang, Boyang Zhang, Tao Jin, and 1 others. 2025. Voxdialogue: Can spoken dialogue systems understand information beyond words? In The Thirteenth International Conference on Learning Representations

work page 2025
[5]

Cheng-Han Chiang and Hung-yi Lee. 2023. https://doi.org/10.18653/v1/2023.acl-long.870 Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607--15631, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-long.870 2023
[6]

Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, and Lijuan Wang. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.25 Audio-aware large language models as judges for speaking styles . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages...

work page doi:10.18653/v1/2025.findings-emnlp.25 2025
[7]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37--46

work page 1960
[8]

Gheorghe Comanici and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, and Eng Siong Chng. 2025. Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models. arXiv preprint arXiv:2511.00850

work page arXiv 2025
[11]

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2025. Llama-omni: Seamless speech interaction with large language models. In The Thirteenth International Conference on Learning Representations

work page 2025
[12]

Tiantian Feng, Kevin Huang, Anfeng Xu, Xuan Shi, Thanathai Lertpetchpun, Jihwan Lee, Yoonjeong Lee, Dani Byrd, and Shrikanth Narayanan. 2025 a . Voxlect: A speech foundation model benchmark for modeling dialects and regional languages around the globe. arXiv preprint arXiv:2508.01691

work page arXiv 2025
[13]

Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd, and 1 others. 2025 b . Vox-profile: A speech foundation model benchmark for characterizing diverse speaker and speech traits. arXiv preprint arXiv:2505.14648

work page arXiv 2025
[14]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378

work page 1971
[15]

Google . 2025 a . Advanced audio dialog and generation with Gemini 2.5 . https://blog.google/technology/google-deepmind/gemini-2-5-native-audio/

work page 2025
[16]

Google . 2025 b . Gemini Live: A more helpful, natural and visual assistant . https://blog.google/products/gemini/gemini-live-updates-august-2025/

work page 2025
[17]

Chi Han, Xin Liu, Haodong Wang, Shiyang Li, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Qingyu Yin, Liang Qiu, Changlong Yu, Yifan Gao, Zheng Li, Bing Yin, Jingbo Shang, and Heng Ji. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1387 Can language models follow multiple turns of entangled instructions? In Findings of the Association for Computati...

work page doi:10.18653/v1/2025.findings-emnlp.1387 2025
[18]

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, and 1 others. 2025. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946

work page arXiv 2025
[19]

Heeseung Kim, Che Hyun Lee, Sangkwon Park, Jiheum Yeom, Nohil Park, Sangwon Yu, and Sungroh Yoon. 2025. https://doi.org/10.18653/v1/2025.findings-acl.470 Does your voice assistant remember? analyzing conversational context recall and utilization in voice interaction models . In Findings of the Association for Computational Linguistics: ACL 2025, pages 898...

work page doi:10.18653/v1/2025.findings-acl.470 2025
[20]

Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae Yu, Pei Zhou, Ronan Bras, Malihe Alikhani, Gunhee Kim, and 1 others. 2023. Soda: Million-scale dialogue distillation with social commonsense contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12930--12949

work page 2023
[21]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in neural information processing systems, volume 33, pages 17022--17033

work page 2020
[22]

Klaus Krippendorff. 1980. Content analysis: An introduction to its methodology. Sage publications

work page 1980
[23]

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1124 MT -eval: A multi-turn capabilities evaluation benchmark for large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages...

work page doi:10.18653/v1/2024.emnlp-main.1124 2024
[24]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[25]

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2025. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, and Yuan Wu. 2025. https://doi.org/10.18653/v1/2025.findings-acl.486 S truct F low B ench: A structured flow benchmark for multi-turn instruction following . In Findings of the Association for Computational Linguistics: ACL 2025, pages 9322--9341, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.findings-acl.486 2025
[27]

Yu-Xiang Lin, Chih-Kai Yang, Wei-Chih Chen, Chen-An Li, Chien-yu Huang, Xuanjun Chen, and Hung-yi Lee. 2025. A preliminary exploration with gpt-4o voice mode. arXiv preprint arXiv:2502.09940

work page arXiv 2025
[28]

Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, and Yu Wang. 2025. Vocalbench: Benchmarking the vocal conversational abilities for speech interaction models. arXiv preprint arXiv:2505.15727

work page arXiv 2025
[29]

Chengqian Ma, Wei Tao, and Steven Y Guo. 2025. C3: A bilingual benchmark for spoken dialogue models exploring challenges in complex conversations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 22789--22807

work page 2025
[30]

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. 2024. emotion2vec: Self-supervised pre-training for speech emotion representation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 15747--15760

work page 2024
[31]

Brian W Matthews. 1975. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442--451

work page 1975
[32]

Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, and Jatinder Singh. 2025. https://doi.org/10.1145/3715275.3732038 Position is power: System prompts as a mechanism of bias in large language models (llms) . In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT '25, page 573–598, New York, NY, USA. Association ...

work page doi:10.1145/3715275.3732038 2025
[33]

Nvidia . 2025. NVIDIA Speech AI Models Deliver Industry-Leading Accuracy and Performance . https://developer.nvidia.com/blog/nvidia-speech-ai-models-deliver-industry-\ -accuracy-and-performance/

work page 2025
[34]

OpenAI . 2024. GPT-4o mini: Advancing Cost-Efficient Intelligence . https://openai.com/index/gpt-4o-mini-advancing\\-cost-efficient-intelligence/

work page 2024
[35]

OpenAI. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

OpenAI . 2025. Introducing GPT-5 . https://openai.com/index/introducing-gpt-5/

work page 2025
[37]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2019. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 527--536

work page 2019
[38]

Ayesha Qamar, Jonathan Tong, and Ruihong Huang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1271 Do LLM s understand dialogues? a case study on dialogue acts . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26219--26237, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.acl-long.1271 2025
[39]

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2022. Utmos: Utokyo-sarulab system for voicemos challenge 2022. In Proc. Interspeech 2022, pages 4521--4525

work page 2022
[40]

Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, and Yongbin Li. 2023. Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue agents. Advances in Neural Information Processing Systems, 36:39088--39118

work page 2023
[41]

Steinmetz and Joshua D

Christian J. Steinmetz and Joshua D. Reiss. 2021. pyloudnorm: A simple yet flexible loudness meter in python. In 150th AES Convention

work page 2021
[42]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. 2024. The instruction hierarchy: Training llms to prioritize privileged instructions. arXiv preprint arXiv:2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, and 1 others. 2025. Step-audio 2 technical report. arXiv preprint arXiv:2507.16632

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Jin Xu and 1 others. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, and Xie Chen. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.933 URO -bench: Towards comprehensive evaluation for end-to-end spoken dialogue models . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 17211--17242, Suzhou, China. Association fo...

work page doi:10.18653/v1/2025.findings-emnlp.933 2025
[47]

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. 2024. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot. arXiv preprint arXiv:2412.02612

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, and 1 others. 2025. Vstyle: A benchmark for voice style adaptation with spoken instructions. arXiv preprint arXiv:2509.09716

work page arXiv 2025
[49]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[50]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page