{"paper":{"title":"Audio-Image Cross-Modal Retrieval with Onomatopoeic Images","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Training modality-specific projection heads on paired onomatopoeic data enables bidirectional audio-image retrieval.","cross_cats":[],"primary_cat":"eess.AS","authors_text":"Keisuke Imoto, Takao Tsuchiya, Yamato Kojima","submitted_at":"2026-05-17T15:42:41Z","abstract_excerpt":"Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impressions through letter shapes, strokes, layouts, and decorative patterns. However, cross-modal retrieval between onomatopoeic images and general sounds has been largely unexplored. This paper thus introduces a bidirectional retrieval framework between onomatopoeic images and the corresponding sound clips. Instead of di"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experimental results show that the proposed method substantially outperforms a zero-shot baseline using pretrained CLIP and CLAP embeddings.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That training modality-specific projection heads on the MIAO dataset will produce embeddings that generalize to unseen onomatopoeic images and sounds outside the 50 classes.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Introduces a cross-modal retrieval framework using modality-specific projection heads on CLIP and CLAP embeddings together with the new MIAO dataset of 50 sound event classes for onomatopoeic image-sound pairs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Training modality-specific projection heads on paired onomatopoeic data enables bidirectional audio-image retrieval.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6626763fd4712f8df02cf94e9653fd2c6d60f7d870aedd63ae9b950d2fccb1d1"},"source":{"id":"2605.17509","kind":"arxiv","version":1},"verdict":{"id":"562d28de-fabc-4610-a09a-b6c3d5273350","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T22:26:44.686364Z","strongest_claim":"Experimental results show that the proposed method substantially outperforms a zero-shot baseline using pretrained CLIP and CLAP embeddings.","one_line_summary":"Introduces a cross-modal retrieval framework using modality-specific projection heads on CLIP and CLAP embeddings together with the new MIAO dataset of 50 sound event classes for onomatopoeic image-sound pairs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That training modality-specific projection heads on the MIAO dataset will produce embeddings that generalize to unseen onomatopoeic images and sounds outside the 50 classes.","pith_extraction_headline":"Training modality-specific projection heads on paired onomatopoeic data enables bidirectional audio-image retrieval."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.17509/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T23:01:19.520826Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T22:41:16.881101Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T21:41:57.657160Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T21:33:23.631609Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"0e717cd2f1e2d723d64bad681fb4dc98d04cc15bf487bd1a10f698a3e28016ca"},"references":{"count":12,"sample":[{"doi":"","year":2021,"title":"Learning transferable visual models from natural languag e supervision,","work_id":"b49ab322-7112-4324-a3b8-3aee70a9cd3a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Large-scale contrastive language-audio pretraining wit h feature fusion and keyword-to-caption augmentation,","work_id":"78e8b04c-2a48-4db4-aac4-ee3121dc73a1","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"AudioCLIP: Ex tending clip to image, text and audio,","work_id":"3a3195df-1f20-4834-87c9-b72129c82052","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Wav2 CLIP: Learning robust audio representations from clip,","work_id":"5cdf3b5f-5cbe-4dfe-a6c2-b5a35ec58fce","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"ImageBind: One embedding space to bind them all,","work_id":"e79f54ee-89fb-4220-bc2a-0cdebbf413e9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":12,"snapshot_sha256":"a288ab65a1d3387892ac9536493c841878a50a8cc2cdcdd10cdc71475f9e101f","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0d64fe87fb49b55c9f05db73324aa25eb0025f11b9dc9c370520ad52e214a582"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}