pith. sign in
Pith Number

pith:C2GQ777B

pith:2024:C2GQ777BR6S5ZGOGYLD3N4NGXE
not attested not anchored not stored refs resolved

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Jie Tang, Kedong Wang, Lei Zhao, Mingdao Liu, Shengmin Jiang, Yuxiao Dong, Zhengxiao Du

GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering.

arxiv:2412.02612 v1 · 2024-12-03 · cs.CL · cs.SD · eess.AS

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{C2GQ777BR6S5ZGOGYLD3N4NGXE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality.

C2weakest assumption

The synthesized speech-text interleaved data and the ultra-low-bitrate tokenizer preserve sufficient information for nuanced vocal control and accurate spoken question answering without introducing systematic artifacts or information loss that would undermine the claimed gains.

C3one line summary

GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.

References

50 extracted · 50 resolved · 8 Pith anchors

[1] Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms
[2] Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms · doi:10.48550/arxiv.2407.04051
[3] Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing 2022
[4] Tyers, and Gregor Weber 2020
[5] Semantic parsing on freebase from question-answer pairs 2013

Formal links

2 machine-checked theorem links

Cited by

32 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.177452Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

168d0fffe18fa5dc99c6c2c7b6f1a6b90cdc30d51a9d3ba7487fbfdfe6f1131f

Aliases

arxiv: 2412.02612 · arxiv_version: 2412.02612v1 · doi: 10.48550/arxiv.2412.02612 · pith_short_12: C2GQ777BR6S5 · pith_short_16: C2GQ777BR6S5ZGOG · pith_short_8: C2GQ777B
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/C2GQ777BR6S5ZGOGYLD3N4NGXE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 168d0fffe18fa5dc99c6c2c7b6f1a6b90cdc30d51a9d3ba7487fbfdfe6f1131f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "06a88367091efb37490e4ba14e5336a532d84d97a04058097c3a8edf47df8ebd",
    "cross_cats_sorted": [
      "cs.SD",
      "eess.AS"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-12-03T17:41:24Z",
    "title_canon_sha256": "7541d60a6b93d682e37a502e975d122105084c3539b4b7ab4cd320904b751813"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.02612",
    "kind": "arxiv",
    "version": 1
  }
}