Pith Number

pith:C2GQ777B

pith:2024:C2GQ777BR6S5ZGOGYLD3N4NGXE

not attested not anchored not stored refs resolved

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Jie Tang, Kedong Wang, Lei Zhao, Mingdao Liu, Shengmin Jiang, Yuxiao Dong, Zhengxiao Du

GLM-4-Voice turns a text language model into an end-to-end spoken chatbot that reaches state-of-the-art results in speech language modeling and spoken question answering.

arxiv:2412.02612 v1 · 2024-12-03 · cs.CL · cs.SD · eess.AS

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{C2GQ777BR6S5ZGOGYLD3N4NGXE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality.

C2weakest assumption

The synthesized speech-text interleaved data and the ultra-low-bitrate tokenizer preserve sufficient information for nuanced vocal control and accurate spoken question answering without introducing systematic artifacts or information loss that would undermine the claimed gains.

C3one line summary

GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.

References

50 extracted · 50 resolved · 8 Pith anchors

[1] Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

[2] Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms · doi:10.48550/arxiv.2407.04051

[3] Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing 2022

[4] Tyers, and Gregor Weber 2020

[5] Semantic parsing on freebase from question-answer pairs 2013

Formal links

2 machine-checked theorem links

Cited by

32 papers in Pith

On The Landscape of Spoken Language Models: A Comprehensive Survey

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

A Survey of Audio Reasoning in Multimodal Foundation Models

Receipt and verification

First computed	2026-05-17T23:38:49.177452Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

168d0fffe18fa5dc99c6c2c7b6f1a6b90cdc30d51a9d3ba7487fbfdfe6f1131f

Aliases

arxiv: 2412.02612 · arxiv_version: 2412.02612v1 · doi: 10.48550/arxiv.2412.02612 · pith_short_12: C2GQ777BR6S5 · pith_short_16: C2GQ777BR6S5ZGOG · pith_short_8: C2GQ777B

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/C2GQ777BR6S5ZGOGYLD3N4NGXE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 168d0fffe18fa5dc99c6c2c7b6f1a6b90cdc30d51a9d3ba7487fbfdfe6f1131f

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "06a88367091efb37490e4ba14e5336a532d84d97a04058097c3a8edf47df8ebd",
    "cross_cats_sorted": [
      "cs.SD",
      "eess.AS"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-12-03T17:41:24Z",
    "title_canon_sha256": "7541d60a6b93d682e37a502e975d122105084c3539b4b7ab4cd320904b751813"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2412.02612",
    "kind": "arxiv",
    "version": 1
  }
}