{"paper":{"title":"Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.","cross_cats":[],"primary_cat":"cs.SD","authors_text":"Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara","submitted_at":"2026-05-14T04:04:03Z","abstract_excerpt":"LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing methods typically rely on either fine-tuning the LLM alone or employing pseudo-audio prompts. The former neglects essential acoustic context, while the latter either suffers from limited scalability in data-scarce conditions, or yields inexpressive prompts by leveraging only textual features, ignoring audio modality. To addr"},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That explicitly modeling speech-text alignment during pseudo-audio prompt generation will produce prompts expressive enough to close the modality gap and yield measurable gains in target-domain ASR without any real audio from that domain.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"642631101d7d9ba0800688c48ee0510e343798d6ae250950e34454768f3e0e44"},"source":{"id":"2605.14340","kind":"arxiv","version":1},"verdict":{"id":"1fc87b6b-be6a-4c78-a1ab-c55d839ede3e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:18:41.378462Z","strongest_claim":"Our method efficiently generates highly expressive pseudo-audio prompts that bridges the modality gap, enabling effective target-domain adaptation. Experiments demonstrate that our approach outperforms existing text-only methods, improving both overall error rates and out-of-vocabulary coverage.","one_line_summary":"A speech-text alignment method generates expressive pseudo-audio prompts for effective text-only domain adaptation in LLM-based ASR, outperforming prior text-only approaches on error rates and OOV coverage.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That explicitly modeling speech-text alignment during pseudo-audio prompt generation will produce prompts expressive enough to close the modality gap and yield measurable gains in target-domain ASR without any real audio from that domain.","pith_extraction_headline":""},"references":{"count":38,"sample":[{"doi":"","year":null,"title":"As illustrated in Fig- ure 1, these architectures typically input representations from a pre-trained audio encoder into a trainable projector","work_id":"bf81af42-c4c0-476a-b29e-0b07e29fe268","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Refining Pseudo-Audio Prompts with Speech-Text Alignment for Text-Only Domain Adaptation in LLM-Based ASR","work_id":"b492e420-a958-4bff-8d21-c9e43d631906","ref_index":2,"cited_arxiv_id":"2605.14340","is_internal_anchor":true},{"doi":"","year":null,"title":"LLM-based ASR We follow an LLM-based ASR framework where the LLM is conditioned on an acoustic representation [1]","work_id":"9dab79a8-ae6a-4583-b634-d9ff31339b84","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Hours” denotes the total dura- tion of paired audio-text data used for source training, and “#Samples","work_id":"20ce78fd-cb3d-4bbd-8c63-71386a27c210","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Unlike methods relying only on heuristic em- bedding manipulation, TE2SL employs a learnable Conformer- based refinement module","work_id":"d5341f08-7078-4637-956f-cd66bd50d979","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":38,"snapshot_sha256":"4dcb454b3b2088227fb0901e11c226a6cce9acc7b96edd505fe80e1c3c9321d9","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"3c3c5860aed8a67d29b89e660151402aa3165da351cf1eee91201bc5c6c1b782"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}