pith. sign in
Pith Number

pith:NBCWDCLW

pith:2023:NBCWDCLWXUDO6HQY7HIHSP6I3Q
not attested not anchored not stored refs resolved

AudioPaLM: A Large Language Model That Can Speak and Listen

Alexandru Tudor, Ankur Bapna, Christian Frank, Chulayuth Asawaroengchai, Dalia El Badawy, Damien Vincent, Danny Rozenberg, Dirk Padfield, Duc Dung Nguyen, Eugene Kharitonov, F\'elix de Chaumont Quitry, Hannah Muckenhirn, James Qin, Jiahui Yu, Johan Schalkwyk, Lukas Zilka, Marco Tagliasacchi, Matt Sharifi, Michelle Tadmor Ramanovich, Mihajlo Velimirovi\'c, Neil Zeghidour, Paul K. Rubenstein, Peter Chen, Tara Sainath, Vicky Zayats, Wei Han, Yongqiang Wang, Yu Zhang, Zal\'an Borsos, Zhishuai Zhang

Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems.

arxiv:2306.12925 v1 · 2023-06-22 · cs.CL · cs.AI · cs.SD · eess.AS · stat.ML

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{NBCWDCLWXUDO6HQY7HIHSP6I3Q}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training.

C2weakest assumption

That initializing the multimodal model with text-only LLM weights successfully transfers linguistic knowledge to speech tasks without degrading paralinguistic capabilities inherited from the speech model.

C3one line summary

AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.

References

41 extracted · 41 resolved · 10 Pith anchors

[1] MusicLM: Generating Music From Text · arXiv:2301.11325
[2] PaLM 2 Technical Report · arXiv:2305.10403
[3] ISBN 979-10-95546-34-4 2020
[4] mSLAM: Massively multilingual joint pre-training for speech and text
[5] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri. Findings of the 2019 conf 2019

Formal links

2 machine-checked theorem links

Cited by

34 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.734771Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

6845618976bd06ef1e18f9d0793fc8dc2ca360ecfea224b1d8ccd8828451765a

Aliases

arxiv: 2306.12925 · arxiv_version: 2306.12925v1 · doi: 10.48550/arxiv.2306.12925 · pith_short_12: NBCWDCLWXUDO · pith_short_16: NBCWDCLWXUDO6HQY · pith_short_8: NBCWDCLW
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 6845618976bd06ef1e18f9d0793fc8dc2ca360ecfea224b1d8ccd8828451765a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "54756cc188edb8b0ceb69b9d8ed9d18a2ea26823618d0b8390a681d3f74a04fc",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.SD",
      "eess.AS",
      "stat.ML"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-06-22T14:37:54Z",
    "title_canon_sha256": "4c83b0ade1e5e06c892f58a7dffd3da0129e3c17dc7e8fc1f861af10c1f83811"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2306.12925",
    "kind": "arxiv",
    "version": 1
  }
}