pith. sign in
Pith Number

pith:MXBMD6Y2

pith:2026:MXBMD6Y22XYGBJPHRFNFY7KVTH
not attested not anchored not stored refs resolved

Qwen3-TTS Technical Report

Baosong Yang, Bin Zhang, Dake Guo, Hangrui Hu, Hongkun Hao, Jingren Zhou, Jin Xu, Junyang Lin, Pei Zhang, Ting He, Xinfa Zhu, Xinyu Zhang, Xiong Wang, Zhifang Guo, Zishan Guo, Ziyue Jiang

Qwen3-TTS achieves state-of-the-art multilingual text-to-speech with 3-second voice cloning and low-latency streaming.

arxiv:2601.15621 v1 · 2026-01-22 · cs.SD · cs.CL · eess.AS

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MXBMD6Y22XYGBJPHRFNFY7KVTH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set).

C2weakest assumption

That the chosen benchmarks and subjective tests accurately reflect real-world multilingual use cases and that the 5 million hours of training data contain no systematic quality or bias issues that would degrade performance outside the reported evaluations.

C3one line summary

Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.

References

26 extracted · 26 resolved · 8 Pith anchors

[1] Seed-TTS: A Family of High-Quality Versatile Speech Generation Models · arXiv:2406.02430
[2] F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching · arXiv:2410.06885
[3] High Fidelity Neural Audio Compression · arXiv:2210.13438
[4] Moshi: a speech-text foundation model for real-time dialogue · arXiv:2410.00037
[5] CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens · arXiv:2407.05407

Formal links

1 machine-checked theorem link

Cited by

21 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.851157Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

65c2c1fb1ad5f060a5e7895a5c7d5599cb7cd4c8c452f2ddd5afe4cc33bc702a

Aliases

arxiv: 2601.15621 · arxiv_version: 2601.15621v1 · doi: 10.48550/arxiv.2601.15621 · pith_short_12: MXBMD6Y22XYG · pith_short_16: MXBMD6Y22XYGBJPH · pith_short_8: MXBMD6Y2
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MXBMD6Y22XYGBJPHRFNFY7KVTH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 65c2c1fb1ad5f060a5e7895a5c7d5599cb7cd4c8c452f2ddd5afe4cc33bc702a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "43e7a156c135547a462f7739d8e3537a2f0d147f2c0f63d766e748add5dabc39",
    "cross_cats_sorted": [
      "cs.CL",
      "eess.AS"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.SD",
    "submitted_at": "2026-01-22T03:51:43Z",
    "title_canon_sha256": "d99480f1d84569a33dfe616970481a3a7b0dd54aba79c8dbaf4086a6d7aa9619"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.15621",
    "kind": "arxiv",
    "version": 1
  }
}