{"paper":{"title":"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot","license":"http://creativecommons.org/licenses/by/4.0/","headline":"GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering.","cross_cats":["cs.SD","eess.AS"],"primary_cat":"cs.CL","authors_text":"Aohan Zeng, Jie Tang, Kedong Wang, Lei Zhao, Mingdao Liu, Shengmin Jiang, Yuxiao Dong, Zhengxiao Du","submitted_at":"2024-12-03T17:41:24Z","abstract_excerpt":"We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interlea"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The synthesized speech-text interleaved data and the ultra-low-bitrate tokenizer preserve sufficient information for nuanced vocal control and accurate spoken question answering without introducing systematic artifacts or information loss that would undermine the claimed gains.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"14d4dc893cf5ff95587304523b4c8dbe3ace662cd1ec4076bfdcd9519283f291"},"source":{"id":"2412.02612","kind":"arxiv","version":1},"verdict":{"id":"9e2aa5b6-d391-4f5e-ac77-281600d81335","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T03:48:50.845924Z","strongest_claim":"achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality.","one_line_summary":"GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The synthesized speech-text interleaved data and the ultra-low-bitrate tokenizer preserve sufficient information for nuanced vocal control and accurate spoken question answering without introducing systematic artifacts or information loss that would undermine the claimed gains.","pith_extraction_headline":"GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering."},"references":{"count":50,"sample":[{"doi":"","year":null,"title":"Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms","work_id":"7cd6d289-dca2-414f-99e0-809f37c065fa","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv.2407.04051","year":null,"title":"Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms","work_id":"7cd6d289-dca2-414f-99e0-809f37c065fa","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing","work_id":"d6d2d38d-a03a-44a0-acd7-84cdd7540abe","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tyers, and Gregor Weber","work_id":"b2ecb06c-4d8f-461c-b4c1-e2df14d3b130","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2013,"title":"Semantic parsing on freebase from question-answer pairs","work_id":"279e4368-a78d-459d-81a5-fabb138741b9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":50,"snapshot_sha256":"f8c5f310e103d0bc88b6e5d7391096a54a1b20f5381ac57434b4cd2cd038a600","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"04db6433a248afd6506bba1ffeff15dcae198aa070a6cd9704519466a78e3adc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}