arxiv: 2412.02612 · v1 · submitted 2024-12-03 · 💻 cs.CL · cs.SD· eess.AS

Recognition: 2 theorem links

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng , Zhengxiao Du , Mingdao Liu , Kedong Wang , Shengmin Jiang , Lei Zhao , Yuxiao Dong , Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:48 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords end-to-end spoken chatbotspeech tokenizervoice language modelbilingual voice conversationspeech language modelingspoken question answeringmultimodal pre-trainingvoice synthesis

0 comments

The pith

GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GLM-4-Voice converts a text-based language model into a spoken chatbot that handles real-time bilingual conversations while adjusting emotion, intonation, speech rate, and dialect on command. It introduces an ultra-low-bitrate single-codebook tokenizer derived from an ASR encoder to compress speech into tokens that mix directly with text. The model continues pre-training from GLM-4-9B on a trillion tokens of unsupervised speech, synthesized interleaved speech-text data, and supervised pairs, then fine-tunes on high-quality conversational speech. This produces superior conversational ability and speech quality compared with existing baselines.

Core claim

By adding a vector-quantized bottleneck to an ASR encoder, the system creates a 175-bps, 12.5-Hz single-codebook tokenizer that lets the model continue pre-training from a text-only checkpoint. Scaling to one trillion tokens across mixed speech and text data yields state-of-the-art performance in speech language modeling and spoken question answering; subsequent fine-tuning on conversational speech data further improves dialogue quality and naturalness of the generated voice.

What carries the argument

Ultra-low-bitrate (175 bps) single-codebook speech tokenizer at 12.5 Hz frame rate, obtained by inserting vector quantization into an ASR encoder, that enables direct transfer of knowledge from text pre-training into speech modalities.

Load-bearing premise

The synthesized speech-text interleaved data and the ultra-low-bitrate tokenizer preserve sufficient information for nuanced vocal control and accurate spoken question answering without introducing systematic artifacts or information loss.

What would settle it

A head-to-head test on spoken questions that require distinguishing fine vocal distinctions, such as specific emotions or near-homophone words, where the model produces lower accuracy or more unintelligible output than a cascaded ASR-plus-LLM-plus-TTS baseline.

read the original abstract

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLM-4-Voice gives a practical open recipe for an end-to-end voice chatbot via a 175bps ASR-derived tokenizer and synthesized interleaved data, but the SOTA claims on speech modeling and QA lack the numbers needed to judge them.

read the letter

The main takeaway is that this team built and released a working open-source system for real-time spoken conversation in Chinese and English that follows instructions on emotion, speed, and dialect. The concrete advances are the single-codebook tokenizer at 175 bps and 12.5 Hz pulled from an ASR encoder with a vector-quantized bottleneck, plus the pipeline that turns existing text corpora into speech-text interleaved training data via a text-to-token model. They continue pretrain GLM-4-9B on a mix of unsupervised speech, the synthetic interleaved set, and supervised pairs up to 1T tokens, then fine-tune on high-quality conversational speech and ship the weights publicly. That combination is new enough to be worth looking at for anyone doing multimodal or voice work. The release itself is a plus because it lets others run the model and check the claims directly. The abstract states they reach SOTA on speech language modeling and spoken QA, yet it supplies no scores, baselines, or ablations. Without those details it is difficult to separate the contribution of the low-bitrate tokenizer and the synthetic data from the sheer scale of pretraining or the final fine-tuning set. The stress-test worry about lost prosodic and paralinguistic detail is reasonable until reconstruction metrics or controlled ablations appear. If the full paper has those numbers and they hold up, the work becomes more solid; if not, the gains may trace mostly to data volume. This paper is aimed at groups building voice agents or testing modality-transfer methods who can use the released models as a baseline. It is worth sending to peer review because the technical choices are specific, the system is complete, and the public weights make verification possible even if the current evidence is thin on the performance side.

Referee Report

3 major / 2 minor

Summary. The paper introduces GLM-4-Voice, an end-to-end spoken chatbot supporting Chinese and English that uses a 175bps single-codebook speech tokenizer (12.5 Hz frame rate) derived from an ASR model via a vector-quantized bottleneck. It synthesizes speech-text interleaved data from text corpora via a text-to-token model, continues pre-training from GLM-4-9B on up to 1T tokens (unsupervised speech + interleaved + supervised), claims SOTA in speech language modeling and spoken QA, and after fine-tuning on high-quality conversational speech data reports superior conversational ability and speech quality versus baselines. Open models are released.

Significance. If the empirical claims hold, this would be a meaningful advance in efficient end-to-end spoken dialogue by showing that ultra-low-bitrate tokenization combined with synthesized interleaved data can enable effective modality transfer at trillion-token scale while supporting instruction-controlled prosody and emotion. The public release of the models is a clear strength that supports reproducibility and follow-on work.

major comments (3)

[§4 (Experiments)] §4 (Experiments): The abstract and results assert state-of-the-art performance in speech language modeling and spoken question answering after 1T-token pre-training, yet no quantitative metrics, baseline comparisons, ablation studies, or error analyses are supplied, leaving the central empirical claims unverifiable.
[§3.1 (Speech Tokenizer)] §3.1 (Speech Tokenizer): The 175bps VQ tokenizer is presented as preserving sufficient phonetic, prosodic, and paralinguistic information for nuanced vocal control and accurate spoken QA, but no reconstruction metrics (e.g., emotion classification accuracy or prosody correlation on reconstructed speech) are reported to support this assumption.
[§3.2 (Data Synthesis)] §3.2 (Data Synthesis): The synthesized speech-text interleaved data is central to the modality-transfer pipeline, but no ablations isolating its contribution (versus scale or the final fine-tuning set) are provided, so it is impossible to determine whether downstream gains arise from the proposed method.

minor comments (2)

[§3.1] The notation for tokenizer bitrate, frame rate, and codebook size should be defined explicitly on first use with a short equation or table for clarity.
[§4] Figure captions and evaluation protocol descriptions could be expanded to specify exact metrics and test sets used for the spoken QA and conversational quality comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to provide the requested empirical details.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The abstract and results assert state-of-the-art performance in speech language modeling and spoken question answering after 1T-token pre-training, yet no quantitative metrics, baseline comparisons, ablation studies, or error analyses are supplied, leaving the central empirical claims unverifiable.

Authors: We acknowledge that the current manuscript version does not include sufficient quantitative metrics, baseline comparisons, ablation studies, or error analyses in §4 to fully verify the SOTA claims. In the revised manuscript, we will expand the experiments section with specific metrics (e.g., perplexity for speech language modeling, accuracy on spoken QA tasks), direct comparisons to relevant baselines, ablations on pre-training components, and error analysis to substantiate the claims. revision: yes
Referee: [§3.1 (Speech Tokenizer)] §3.1 (Speech Tokenizer): The 175bps VQ tokenizer is presented as preserving sufficient phonetic, prosodic, and paralinguistic information for nuanced vocal control and accurate spoken QA, but no reconstruction metrics (e.g., emotion classification accuracy or prosody correlation on reconstructed speech) are reported to support this assumption.

Authors: The tokenizer's utility is supported indirectly by the end-to-end system results, but we agree that direct reconstruction metrics would provide stronger evidence. We will add these in the revised §3.1, including emotion classification accuracy and prosody correlation metrics on reconstructed speech. revision: yes
Referee: [§3.2 (Data Synthesis)] §3.2 (Data Synthesis): The synthesized speech-text interleaved data is central to the modality-transfer pipeline, but no ablations isolating its contribution (versus scale or the final fine-tuning set) are provided, so it is impossible to determine whether downstream gains arise from the proposed method.

Authors: We agree that ablations are needed to isolate the interleaved data's contribution. In the revision, we will include controlled ablations comparing models trained with and without the synthesized interleaved data (holding scale and fine-tuning data fixed) to demonstrate its specific impact. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre-training and benchmark evaluation chain is self-contained

full rationale

The paper describes a standard pipeline: derive a 175bps VQ tokenizer from an ASR encoder, synthesize interleaved data via a text-to-token model, continue pre-training GLM-4-9B on 1T tokens of mixed speech/text data, then fine-tune on conversational speech. All performance claims (SOTA speech LM and spoken QA) are obtained by direct comparison to external baselines after training. No equation, parameter, or result is defined in terms of itself or a fitted quantity that is then re-presented as a prediction. The base GLM-4-9B reference is ordinary transfer learning and does not carry the central claims. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the derivation.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 1 invented entities

The central claim depends on standard transformer pre-training assumptions plus two key design choices whose effectiveness is asserted rather than independently derived.

free parameters (3)

tokenizer bitrate = 175 bps
Ultra-low bitrate of 175 bps chosen to enable efficient modeling
frame rate = 12.5 Hz
12.5 Hz frame rate selected for the single-codebook tokenizer
pre-training data volume = 1 trillion tokens
Scale of continued pre-training set at 1 trillion tokens

axioms (2)

domain assumption GLM-4-9B text language model provides a suitable base for speech extension
Continued pre-training begins from this checkpoint
domain assumption Synthesized interleaved speech-text data transfers knowledge effectively from text to speech modalities
Used to bridge the two modalities during pre-training

invented entities (1)

Vector-quantized bottleneck inserted into ASR encoder no independent evidence
purpose: Produces discrete low-bitrate speech tokens
Core component of the new tokenizer

pith-pipeline@v0.9.0 · 5563 in / 1648 out tokens · 78200 ms · 2026-05-16T03:48:50.845924+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
cs.SD 2026-04 unverdicted novelty 6.0

Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
cs.SD 2026-04 unverdicted novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on fo...
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
cs.SD 2026-04 unverdicted novelty 6.0

FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
Step-Audio 2 Technical Report
cs.CL 2025-07 unverdicted novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and c...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 5.0

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
cs.SD 2026-05 unverdicted novelty 5.0

A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
eess.AS 2026-04 unverdicted novelty 5.0

A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 19 Pith papers · 8 internal anchors

[1]

Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Z...

work page arXiv
[2]

URL https://doi.org/10.48550/arXiv.2407.04051

work page doi:10.48550/arxiv.2407.04051
[3]

Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5723–5738, 2022

work page 2022
[4]

Tyers, and Gregor Weber

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020 , pages 4218–4222. Europea...

work page 2020
[5]

Semantic parsing on freebase from question-answer pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL , pages 1533–...

work page 2013
[6]

Audiolm: A language modeling approach to audio generation

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matthew Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533, 2023

work page 2023
[7]

AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, South Korea, November 1-3, 2017 , pages 1–5. IEEE, 2017

work page 2017
[8]

Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio

Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, and Zhiyong Yan. Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. ...

work page 2021
[9]

Speechnet: A universal modularized model for speech processing tasks

Yi-Chen Chen, Po-Han Chi, Shu-wen Yang, Kai-Wei Chang, Jheng-hao Lin, Sung-Feng Huang, Da-Rong Liu, Chi-Liang Liu, Cheng-Kuang Lee, and Hung-yi Lee. Speechnet: A universal modularized model for speech processing tasks. arXiv preprint arXiv:2105.03070, 2021

work page arXiv 2021
[10]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. CoRR, abs/2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021 , pages 244–250. IEEE, 2021

work page 2021
[12]

High fidelity neural audio compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. Trans. Mach. Learn. Res., 2023, 2023. 10

work page 2023
[13]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. Technical report, Kyutai, September 2024. URL http://kyutai.org/Moshi.pdf

work page 2024
[14]

Jukebox: A Generative Model for Music

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. CoRR, abs/2005.00341, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[15]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024. URL https: //arxiv.org/abs/2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Llama-omni: Seamless speech interaction with large language models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama- omni: Seamless speech interaction with large language models, 2024. URL https://arxiv. org/abs/2409.06666

work page arXiv 2024
[17]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Jingyu Sun, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Textually pretrained speech language models

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Défossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, and Yossi Adi. Textually pretrained speech language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,...

work page 2023
[19]

Visqol: an objective speech quality model

Andrew Hines, Jan Skoglund, Anil Kokaram, and Naomi Harte. Visqol: an objective speech quality model. EURASIP Journal on Audio, Speech, and Music Processing , 2015 (13):1–18, 2015

work page 2015
[20]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process. , 29:3451–3460, 2021

work page 2021
[21]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, and Zhou Zhao. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. CoRR, abs/2408.16532, 2024

work page arXiv 2024
[22]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, V ancouver , Canada, July 30 - August 4, V olume 1: Long Papers , pages 1601–1611. Association for C...

work page 2017
[23]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Bal- can, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 17022–17033. Curran Associates, Inc., 2020. URL https://proceedings.neur...

work page 2020
[24]

High-fidelity audio compression with improved RVQGAN

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar. High-fidelity audio compression with improved RVQGAN. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 , 2023. 11

work page 2023
[25]

On generative spoken language modeling from raw audio

Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, and Emmanuel Dupoux. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021

work page 2021
[26]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

work page 2023
[27]

Mosnet: Deep learning-based objective assessment for voice conversion

Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi Yamagishi, Yu Tsao, and Hsin-Min Wang. Mosnet: Deep learning-based objective assessment for voice conversion. In Gernot Kubin and Zdravko Kacic, editors, 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019 , pages 15...

work page doi:10.21437/interspeech.2019-2003 2019
[28]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

Matcha-TTS: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter. Matcha-TTS: A fast TTS architecture with conditional flow matching. In Proc. ICASSP, 2024

work page 2024
[30]

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[31]

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, R. J. Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11...

work page 2024
[32]

Expresso: A benchmark and analysis of discrete expressive speech resynthesis

Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel- Zarandi, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Emmanuel Dupoux. Expresso: A benchmark and analysis of discrete expressive speech resynthesis. In Naomi Harte, Julie Carson-Berndsen, and Gareth Jones, editors, 24th Annual Conferenc...

work page 2023
[33]

Spirit LM: Interleaved Spoken and Written Language Model

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Syn- naeve, Juan Pino, Benoit Sagot, and Emmanuel Dupoux. Spirit-lm: Interleaved spoken and written language model, 2024. URL https://arxiv.org/abs/2402.05755

work page arXiv 2024
[34]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/

work page 2024
[35]

Librispeech: An asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5206–5210, 2015. doi: 10.1109/ ICASSP.2015.7178964

work page arXiv 2015
[36]

MLS: A large-scale multilingual dataset for speech research

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. MLS: A large-scale multilingual dataset for speech research. In 21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020 , pages 2757–2761. ISCA, 2020

work page 2020
[37]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawai...

work page 2023
[38]

Utmos: Utokyo-sarulab system for voicemos challenge 2022

Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022. Interspeech 2022, 2022

work page 2022
[39]

Seaco-paraformer: A non-autoregressive asr system with flexible and effective hotword customization ability

Xian Shi, Yexin Yang, Zerui Li, and Shiliang Zhang. Seaco-paraformer: A non-autoregressive asr system with flexible and effective hotword customization ability. arXiv preprint arXiv:2308.03266 (accepted by ICASSP2024) , 2023

work page arXiv 2023
[40]

Senior, and Koray Kavukcuoglu

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, SSW 2016, Sunnyvale, CA, USA, September 13-15, 2016 , page 125. ISCA, 2016

work page 2016
[41]

Neural discrete representation learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decemb...

work page 2017
[42]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang, Yangze Li, Chaoyou Fu, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm, 2024. URL https://arxiv.org/abs/2411.00774

work page arXiv 2024
[43]

Mini-omni: Language models can hear, talk while thinking in streaming

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming, 2024. URL https://arxiv.org/abs/2408.16725

work page arXiv 2024
[44]

Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit

Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 , p...

work page 2021
[45]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URL https://doi.org/ 10.1109/TASLP.2021.3129994

work page doi:10.1109/taslp.2021.3129994 2022
[46]

Scaling speech-text pre-training with synthetic interleaved data, 2024

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, and Jie Tang. Scaling speech-text pre-training with synthetic interleaved data, 2024. URL https: //arxiv.org/abs/2411.17607

work page arXiv 2024
[47]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023. URL https://arxiv.org/abs/2305.11000

work page arXiv 2023
[48]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Speechtokenizer: Unified speech tokenizer for speech language models

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024

work page 2024
[50]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685. 13 A Appendix A.1 Prompt for Evaluating Spoken Chatbots General QA [Instruct...

work page internal anchor Pith review Pith/arXiv arXiv 2023