{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:NBCWDCLWXUDO6HQY7HIHSP6I3Q","short_pith_number":"pith:NBCWDCLW","schema_version":"1.0","canonical_sha256":"6845618976bd06ef1e18f9d0793fc8dc2ca360ecfea224b1d8ccd8828451765a","source":{"kind":"arxiv","id":"2306.12925","version":1},"attestation_state":"computed","paper":{"title":"AudioPaLM: A Large Language Model That Can Speak and Listen","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems.","cross_cats":["cs.AI","cs.SD","eess.AS","stat.ML"],"primary_cat":"cs.CL","authors_text":"Alexandru Tudor, Ankur Bapna, Christian Frank, Chulayuth Asawaroengchai, Dalia El Badawy, Damien Vincent, Danny Rozenberg, Dirk Padfield, Duc Dung Nguyen, Eugene Kharitonov, F\\'elix de Chaumont Quitry, Hannah Muckenhirn, James Qin, Jiahui Yu, Johan Schalkwyk, Lukas Zilka, Marco Tagliasacchi, Matt Sharifi, Michelle Tadmor Ramanovich, Mihajlo Velimirovi\\'c, Neil Zeghidour, Paul K. Rubenstein, Peter Chen, Tara Sainath, Vicky Zayats, Wei Han, Yongqiang Wang, Yu Zhang, Zal\\'an Borsos, Zhishuai Zhang","submitted_at":"2023-06-22T14:37:54Z","abstract_excerpt":"We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demons"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2306.12925","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-06-22T14:37:54Z","cross_cats_sorted":["cs.AI","cs.SD","eess.AS","stat.ML"],"title_canon_sha256":"4c83b0ade1e5e06c892f58a7dffd3da0129e3c17dc7e8fc1f861af10c1f83811","abstract_canon_sha256":"54756cc188edb8b0ceb69b9d8ed9d18a2ea26823618d0b8390a681d3f74a04fc"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.735357Z","signature_b64":"4guzjZW6eo/nci6E52jvpeiqypIobpoeeLR+VA/egy9WI9wnWRYTQSKCYah3VhJkTPcUVo+8MywUm/1KiovYCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"6845618976bd06ef1e18f9d0793fc8dc2ca360ecfea224b1d8ccd8828451765a","last_reissued_at":"2026-05-17T23:38:48.734771Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.734771Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"AudioPaLM: A Large Language Model That Can Speak and Listen","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems.","cross_cats":["cs.AI","cs.SD","eess.AS","stat.ML"],"primary_cat":"cs.CL","authors_text":"Alexandru Tudor, Ankur Bapna, Christian Frank, Chulayuth Asawaroengchai, Dalia El Badawy, Damien Vincent, Danny Rozenberg, Dirk Padfield, Duc Dung Nguyen, Eugene Kharitonov, F\\'elix de Chaumont Quitry, Hannah Muckenhirn, James Qin, Jiahui Yu, Johan Schalkwyk, Lukas Zilka, Marco Tagliasacchi, Matt Sharifi, Michelle Tadmor Ramanovich, Mihajlo Velimirovi\\'c, Neil Zeghidour, Paul K. Rubenstein, Peter Chen, Tara Sainath, Vicky Zayats, Wei Han, Yongqiang Wang, Yu Zhang, Zal\\'an Borsos, Zhishuai Zhang","submitted_at":"2023-06-22T14:37:54Z","abstract_excerpt":"We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demons"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That initializing the multimodal model with text-only LLM weights successfully transfers linguistic knowledge to speech tasks without degrading paralinguistic capabilities inherited from the speech model.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7888c1339eebf57b00fbf1ce7aa3978acdc5ffc783e49808622b74a712314bf0"},"source":{"id":"2306.12925","kind":"arxiv","version":1},"verdict":{"id":"7670d4cf-f54b-4868-bece-9f81dcfc05bb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T07:03:23.799538Z","strongest_claim":"The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training.","one_line_summary":"AudioPaLM unifies PaLM-2 and AudioLM to outperform prior systems on speech translation while enabling zero-shot speech-to-text for many unseen language pairs and voice transfer from short prompts.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That initializing the multimodal model with text-only LLM weights successfully transfers linguistic knowledge to speech tasks without degrading paralinguistic capabilities inherited from the speech model.","pith_extraction_headline":"Fusing a text language model with a speech model and initializing from text weights produces a system that processes and generates both modalities while outperforming prior speech translation systems."},"references":{"count":41,"sample":[{"doi":"","year":null,"title":"MusicLM: Generating Music From Text","work_id":"15e6566e-1c36-468f-966e-823248cbf87f","ref_index":1,"cited_arxiv_id":"2301.11325","is_internal_anchor":true},{"doi":"","year":null,"title":"PaLM 2 Technical Report","work_id":"905ee9a7-ea61-4a94-bd62-2600cbe3e315","ref_index":2,"cited_arxiv_id":"2305.10403","is_internal_anchor":true},{"doi":"","year":2020,"title":"ISBN 979-10-95546-34-4","work_id":"21667f47-f003-4382-b396-17b57c530dd7","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"mSLAM: Massively multilingual joint pre-training for speech and text","work_id":"c143c52c-596d-4470-b0f4-abac681bd925","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and M. Zampieri. Findings of the 2019 conf","work_id":"0c91c9d2-37ba-46d7-b2b3-d74c9b276fd3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":41,"snapshot_sha256":"0ff9b9a65c23f886b56b16c49d9f7962648b154f4248d33ff4705ac2813f17be","internal_anchors":10},"formal_canon":{"evidence_count":2,"snapshot_sha256":"527b2d1c735814595e76ae9d9cc367913104b5e6ce7bee41f254d58906f5cafa"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2306.12925","created_at":"2026-05-17T23:38:48.734859+00:00"},{"alias_kind":"arxiv_version","alias_value":"2306.12925v1","created_at":"2026-05-17T23:38:48.734859+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2306.12925","created_at":"2026-05-17T23:38:48.734859+00:00"},{"alias_kind":"pith_short_12","alias_value":"NBCWDCLWXUDO","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"NBCWDCLWXUDO6HQY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"NBCWDCLW","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2406.14294","citing_title":"DASB - Discrete Audio and Speech Benchmark","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2409.18512","citing_title":"Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22170","citing_title":"Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.03526","citing_title":"Enhancing Speech Large Language Models through Reinforced Behavior Alignment","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20266","citing_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20946","citing_title":"Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20588","citing_title":"Direct Translation between Sign Languages","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17583","citing_title":"AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19101","citing_title":"Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2507.23511","citing_title":"MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2510.03093","citing_title":"Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2310.13289","citing_title":"SALMONN: Towards Generic Hearing Abilities for Large Language Models","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21517","citing_title":"Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2512.10931","citing_title":"Asynchronous Reasoning: Training-Free Interactive Thinking LLMs","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16378","citing_title":"Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01249","citing_title":"Generative AI in Signal Processing Education: An Audio Foundation Model Based Approach","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2507.16632","citing_title":"Step-Audio 2 Technical Report","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":152,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14125","citing_title":"VideoPoet: A Large Language Model for Zero-Shot Video Generation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2503.12605","citing_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","ref_index":207,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05094","citing_title":"TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15045","citing_title":"LLMs and Speech: Integration vs. Combination","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14340","citing_title":"Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2310.05737","citing_title":"Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation","ref_index":216,"is_internal_anchor":true},{"citing_arxiv_id":"2410.00037","citing_title":"Moshi: a speech-text foundation model for real-time dialogue","ref_index":79,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q","json":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q.json","graph_json":"https://pith.science/api/pith-number/NBCWDCLWXUDO6HQY7HIHSP6I3Q/graph.json","events_json":"https://pith.science/api/pith-number/NBCWDCLWXUDO6HQY7HIHSP6I3Q/events.json","paper":"https://pith.science/paper/NBCWDCLW"},"agent_actions":{"view_html":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q","download_json":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q.json","view_paper":"https://pith.science/paper/NBCWDCLW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2306.12925&json=true","fetch_graph":"https://pith.science/api/pith-number/NBCWDCLWXUDO6HQY7HIHSP6I3Q/graph.json","fetch_events":"https://pith.science/api/pith-number/NBCWDCLWXUDO6HQY7HIHSP6I3Q/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q/action/timestamp_anchor","attest_storage":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q/action/storage_attestation","attest_author":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q/action/author_attestation","sign_citation":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q/action/citation_signature","submit_replication":"https://pith.science/pith/NBCWDCLWXUDO6HQY7HIHSP6I3Q/action/replication_record"}},"created_at":"2026-05-17T23:38:48.734859+00:00","updated_at":"2026-05-17T23:38:48.734859+00:00"}