pith. machine review for the scientific record. sign in

arxiv: 2412.02612 · v1 · submitted 2024-12-03 · 💻 cs.CL · cs.SD· eess.AS

Recognition: 2 theorem links

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:48 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords end-to-end spoken chatbotspeech tokenizervoice language modelbilingual voice conversationspeech language modelingspoken question answeringmultimodal pre-trainingvoice synthesis
0
0 comments X

The pith

GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GLM-4-Voice converts a text-based language model into a spoken chatbot that handles real-time bilingual conversations while adjusting emotion, intonation, speech rate, and dialect on command. It introduces an ultra-low-bitrate single-codebook tokenizer derived from an ASR encoder to compress speech into tokens that mix directly with text. The model continues pre-training from GLM-4-9B on a trillion tokens of unsupervised speech, synthesized interleaved speech-text data, and supervised pairs, then fine-tunes on high-quality conversational speech. This produces superior conversational ability and speech quality compared with existing baselines.

Core claim

By adding a vector-quantized bottleneck to an ASR encoder, the system creates a 175-bps, 12.5-Hz single-codebook tokenizer that lets the model continue pre-training from a text-only checkpoint. Scaling to one trillion tokens across mixed speech and text data yields state-of-the-art performance in speech language modeling and spoken question answering; subsequent fine-tuning on conversational speech data further improves dialogue quality and naturalness of the generated voice.

What carries the argument

Ultra-low-bitrate (175 bps) single-codebook speech tokenizer at 12.5 Hz frame rate, obtained by inserting vector quantization into an ASR encoder, that enables direct transfer of knowledge from text pre-training into speech modalities.

Load-bearing premise

The synthesized speech-text interleaved data and the ultra-low-bitrate tokenizer preserve sufficient information for nuanced vocal control and accurate spoken question answering without introducing systematic artifacts or information loss.

What would settle it

A head-to-head test on spoken questions that require distinguishing fine vocal distinctions, such as specific emotions or near-homophone words, where the model produces lower accuracy or more unintelligible output than a cascaded ASR-plus-LLM-plus-TTS baseline.

read the original abstract

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces GLM-4-Voice, an end-to-end spoken chatbot supporting Chinese and English that uses a 175bps single-codebook speech tokenizer (12.5 Hz frame rate) derived from an ASR model via a vector-quantized bottleneck. It synthesizes speech-text interleaved data from text corpora via a text-to-token model, continues pre-training from GLM-4-9B on up to 1T tokens (unsupervised speech + interleaved + supervised), claims SOTA in speech language modeling and spoken QA, and after fine-tuning on high-quality conversational speech data reports superior conversational ability and speech quality versus baselines. Open models are released.

Significance. If the empirical claims hold, this would be a meaningful advance in efficient end-to-end spoken dialogue by showing that ultra-low-bitrate tokenization combined with synthesized interleaved data can enable effective modality transfer at trillion-token scale while supporting instruction-controlled prosody and emotion. The public release of the models is a clear strength that supports reproducibility and follow-on work.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments): The abstract and results assert state-of-the-art performance in speech language modeling and spoken question answering after 1T-token pre-training, yet no quantitative metrics, baseline comparisons, ablation studies, or error analyses are supplied, leaving the central empirical claims unverifiable.
  2. [§3.1 (Speech Tokenizer)] §3.1 (Speech Tokenizer): The 175bps VQ tokenizer is presented as preserving sufficient phonetic, prosodic, and paralinguistic information for nuanced vocal control and accurate spoken QA, but no reconstruction metrics (e.g., emotion classification accuracy or prosody correlation on reconstructed speech) are reported to support this assumption.
  3. [§3.2 (Data Synthesis)] §3.2 (Data Synthesis): The synthesized speech-text interleaved data is central to the modality-transfer pipeline, but no ablations isolating its contribution (versus scale or the final fine-tuning set) are provided, so it is impossible to determine whether downstream gains arise from the proposed method.
minor comments (2)
  1. [§3.1] The notation for tokenizer bitrate, frame rate, and codebook size should be defined explicitly on first use with a short equation or table for clarity.
  2. [§4] Figure captions and evaluation protocol descriptions could be expanded to specify exact metrics and test sets used for the spoken QA and conversational quality comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide the requested empirical details.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): The abstract and results assert state-of-the-art performance in speech language modeling and spoken question answering after 1T-token pre-training, yet no quantitative metrics, baseline comparisons, ablation studies, or error analyses are supplied, leaving the central empirical claims unverifiable.

    Authors: We acknowledge that the current manuscript version does not include sufficient quantitative metrics, baseline comparisons, ablation studies, or error analyses in §4 to fully verify the SOTA claims. In the revised manuscript, we will expand the experiments section with specific metrics (e.g., perplexity for speech language modeling, accuracy on spoken QA tasks), direct comparisons to relevant baselines, ablations on pre-training components, and error analysis to substantiate the claims. revision: yes

  2. Referee: [§3.1 (Speech Tokenizer)] §3.1 (Speech Tokenizer): The 175bps VQ tokenizer is presented as preserving sufficient phonetic, prosodic, and paralinguistic information for nuanced vocal control and accurate spoken QA, but no reconstruction metrics (e.g., emotion classification accuracy or prosody correlation on reconstructed speech) are reported to support this assumption.

    Authors: The tokenizer's utility is supported indirectly by the end-to-end system results, but we agree that direct reconstruction metrics would provide stronger evidence. We will add these in the revised §3.1, including emotion classification accuracy and prosody correlation metrics on reconstructed speech. revision: yes

  3. Referee: [§3.2 (Data Synthesis)] §3.2 (Data Synthesis): The synthesized speech-text interleaved data is central to the modality-transfer pipeline, but no ablations isolating its contribution (versus scale or the final fine-tuning set) are provided, so it is impossible to determine whether downstream gains arise from the proposed method.

    Authors: We agree that ablations are needed to isolate the interleaved data's contribution. In the revision, we will include controlled ablations comparing models trained with and without the synthesized interleaved data (holding scale and fine-tuning data fixed) to demonstrate its specific impact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre-training and benchmark evaluation chain is self-contained

full rationale

The paper describes a standard pipeline: derive a 175bps VQ tokenizer from an ASR encoder, synthesize interleaved data via a text-to-token model, continue pre-training GLM-4-9B on 1T tokens of mixed speech/text data, then fine-tune on conversational speech. All performance claims (SOTA speech LM and spoken QA) are obtained by direct comparison to external baselines after training. No equation, parameter, or result is defined in terms of itself or a fitted quantity that is then re-presented as a prediction. The base GLM-4-9B reference is ordinary transfer learning and does not carry the central claims. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the derivation.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claim depends on standard transformer pre-training assumptions plus two key design choices whose effectiveness is asserted rather than independently derived.

free parameters (3)
  • tokenizer bitrate = 175 bps
    Ultra-low bitrate of 175 bps chosen to enable efficient modeling
  • frame rate = 12.5 Hz
    12.5 Hz frame rate selected for the single-codebook tokenizer
  • pre-training data volume = 1 trillion tokens
    Scale of continued pre-training set at 1 trillion tokens
axioms (2)
  • domain assumption GLM-4-9B text language model provides a suitable base for speech extension
    Continued pre-training begins from this checkpoint
  • domain assumption Synthesized interleaved speech-text data transfers knowledge effectively from text to speech modalities
    Used to bridge the two modalities during pre-training
invented entities (1)
  • Vector-quantized bottleneck inserted into ASR encoder no independent evidence
    purpose: Produces discrete low-bitrate speech tokens
    Core component of the new tokenizer

pith-pipeline@v0.9.0 · 5563 in / 1648 out tokens · 78200 ms · 2026-05-16T03:48:50.845924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

    cs.SD 2026-05 unverdicted novelty 7.0

    AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

  2. How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

    cs.CL 2026-05 unverdicted novelty 7.0

    Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...

  3. VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

    cs.CL 2026-05 unverdicted novelty 7.0

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...

  4. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 7.0

    TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.

  5. SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

  6. Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection

    cs.CR 2026-04 unverdicted novelty 7.0

    AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.

  7. Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

  8. HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...

  9. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  10. The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

    eess.AS 2026-03 unverdicted novelty 7.0

    FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.

  11. Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints

    cs.SD 2026-04 unverdicted novelty 6.0

    Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.

  12. GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

    cs.SD 2026-04 unverdicted novelty 6.0

    GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...

  13. Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

    eess.AS 2026-04 unverdicted novelty 6.0

    A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

  14. FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

    cs.SD 2026-04 unverdicted novelty 6.0

    FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.

  15. Step-Audio 2 Technical Report

    cs.CL 2025-07 unverdicted novelty 6.0

    Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...

  16. Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

    cs.CL 2026-05 unverdicted novelty 5.0

    TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

  17. Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

    cs.SD 2026-05 unverdicted novelty 5.0

    A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...

  18. Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

    eess.AS 2026-04 unverdicted novelty 5.0

    A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.

  19. Kimi-Audio Technical Report

    eess.AS 2025-04 unverdicted novelty 5.0

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...

  20. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

    Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Z...

  2. [2]

    URL https://doi.org/10.48550/arXiv.2407.04051

  3. [3]

    Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

    Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5723–5738, 2022

  4. [4]

    Tyers, and Gregor Weber

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020 , pages 4218–4222. Europea...

  5. [5]

    Semantic parsing on freebase from question-answer pairs

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 1533–...

  6. [6]

    Audiolm: A language modeling approach to audio generation

    Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533, 2023

  7. [7]

    AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline

    Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, South Korea, November 1-3, 2017 , pages 1–5. IEEE, 2017

  8. [8]

    Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio

    Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, and Zhiyong Yan. Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. ...

  9. [9]

    Speechnet: A universal modularized model for speech processing tasks

    Yi-Chen Chen, Po-Han Chi, Shu-wen Yang, Kai-Wei Chang, Jheng-hao Lin, Sung-Feng Huang, Da-Rong Liu, Chi-Liang Liu, Cheng-Kuang Lee, and Hung-yi Lee. Speechnet: A universal modularized model for speech processing tasks. arXiv preprint arXiv:2105.03070, 2021

  10. [10]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. CoRR, abs/2311.07919, 2023

  11. [11]

    w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

    Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021 , pages 244–250. IEEE, 2021

  12. [12]

    High fidelity neural audio compression

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Trans. Mach. Learn. Res., 2023, 2023. 10

  13. [13]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. Technical report, Kyutai, September 2024. URL http://kyutai.org/Moshi.pdf

  14. [14]

    Jukebox: A Generative Model for Music

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. CoRR, abs/2005.00341, 2020

  15. [15]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024. URL https: //arxiv.org/abs/2407.05407

  16. [16]

    Llama-omni: Seamless speech interaction with large language models

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama- omni: Seamless speech interaction with large language models, 2024. URL https://arxiv. org/abs/2409.06666

  17. [17]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shu...

  18. [18]

    Textually pretrained speech language models

    Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. Textually pretrained speech language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,...

  19. [19]

    Visqol: an objective speech quality model

    Andrew Hines, Jan Skoglund, Anil Kokaram, and Naomi Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing , 2015 (13):1–18, 2015

  20. [20]

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units

    Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process. , 29:3451–3460, 2021

  21. [21]

    Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

    Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, and Zhou Zhao. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. CoRR, abs/2408.16532, 2024

  22. [22]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, V ancouver , Canada, July 30 - August 4, V olume 1: Long Papers , pages 1601–1611. Association for C...

  23. [23]

    Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Bal- can, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 17022–17033. Curran Associates, Inc., 2020. URL https://proceedings.neur...

  24. [24]

    High-fidelity audio compression with improved RVQGAN

    Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. 11

  25. [25]

    On generative spoken language modeling from raw audio

    Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

  26. [26]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

  27. [27]

    Mosnet: Deep learning-based objective assessment for voice conversion

    Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning-based objective assessment for voice conversion. In Gernot Kubin and Zdravko Kacic, editors, 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019 , pages 15...

  28. [28]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

  29. [29]

    Matcha-TTS: A fast TTS architecture with conditional flow matching

    Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-TTS: A fast TTS architecture with conditional flow matching. In Proc. ICASSP, 2024

  30. [30]

    Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696, 2016

  31. [31]

    Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, R. J. Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11...

  32. [32]

    Expresso: A benchmark and analysis of discrete expressive speech resynthesis

    Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel- Zarandi, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones, editors, 24th Annual Conferenc...

  33. [33]

    Spirit LM: Interleaved Spoken and Written Language Model

    Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Syn- naeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model, 2024. URL https://arxiv.org/abs/2402.05755

  34. [34]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

  35. [35]

    Librispeech: An asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5206–5210, 2015. doi: 10.1109/ ICASSP.2015.7178964

  36. [36]

    MLS: A large-scale multilingual dataset for speech research

    Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. MLS: A large-scale multilingual dataset for speech research. In 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020 , pages 2757–2761. ISCA, 2020

  37. [37]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawai...

  38. [38]

    Utmos: Utokyo-sarulab system for voicemos challenge 2022

    Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. Interspeech 2022, 2022

  39. [39]

    Seaco-paraformer: A non-autoregressive asr system with flexible and effective hotword customization ability

    Xian Shi, Yexin Yang, Zerui Li, and Shiliang Zhang. Seaco-paraformer: A non-autoregressive asr system with flexible and effective hotword customization ability. arXiv preprint arXiv:2308.03266 (accepted by ICASSP2024) , 2023

  40. [40]

    Senior, and Koray Kavukcuoglu

    Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, SSW 2016, Sunnyvale, CA, USA, September 13-15, 2016 , page 125. ISCA, 2016

  41. [41]

    Neural discrete representation learning

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decemb...

  42. [42]

    Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

    Xiong Wang, Yangze Li, Chaoyou Fu, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm, 2024. URL https://arxiv.org/abs/2411.00774

  43. [43]

    Mini-omni: Language models can hear, talk while thinking in streaming

    Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming, 2024. URL https://arxiv.org/abs/2408.16725

  44. [44]

    Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

    Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , p...

  45. [45]

    Soundstream: An end-to-end neural audio codec

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URL https://doi.org/ 10.1109/TASLP.2021.3129994

  46. [46]

    Scaling speech-text pre-training with synthetic interleaved data, 2024

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, and Jie Tang. Scaling speech-text pre-training with synthetic interleaved data, 2024. URL https: //arxiv.org/abs/2411.17607

  47. [47]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023. URL https://arxiv.org/abs/2305.11000

  48. [48]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022

  49. [49]

    Speechtokenizer: Unified speech tokenizer for speech language models

    Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024

  50. [50]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685. 13 A Appendix A.1 Prompt for Evaluating Spoken Chatbots General QA [Instruct...