{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:BT6II3QL27YCFVKH5JUSPUGTZP","short_pith_number":"pith:BT6II3QL","schema_version":"1.0","canonical_sha256":"0cfc846e0bd7f022d547ea6927d0d3cbfd89940e84fdef2238d94e6032e07f58","source":{"kind":"arxiv","id":"2506.04779","version":3},"attestation_state":"computed","paper":{"title":"MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MMSU benchmark shows current SpeechLLMs have substantial room for improvement in fine-grained spoken language understanding and reasoning.","cross_cats":["cs.SD","eess.AS"],"primary_cat":"cs.CL","authors_text":"Dingdong Wang, Dongchao Yang, Helen Meng, Jincenzi Wu, Junan Li, Tianhua Zhang, Xueyuan Chen","submitted_at":"2025-06-05T09:09:36Z","abstract_excerpt":"Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2506.04779","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2025-06-05T09:09:36Z","cross_cats_sorted":["cs.SD","eess.AS"],"title_canon_sha256":"34a512c230e4ac979ff8cefdeccb4c211d40636bd1a37df4ddf69434f28af1f3","abstract_canon_sha256":"481749de47ba8e9209a5209b8f90181c4c35f07e514039babb3b91f7d9116e68"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.516749Z","signature_b64":"WqgV8KRgAhCD07FA9l9zAWJQ0m4lL78whGU0IMWyZwHbed29tGFeh4jrC/o+WKGZ0Tf/nLnHETlQnQazg4KUDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0cfc846e0bd7f022d547ea6927d0d3cbfd89940e84fdef2238d94e6032e07f58","last_reissued_at":"2026-05-17T23:38:13.515980Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.515980Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MMSU benchmark shows current SpeechLLMs have substantial room for improvement in fine-grained spoken language understanding and reasoning.","cross_cats":["cs.SD","eess.AS"],"primary_cat":"cs.CL","authors_text":"Dingdong Wang, Dongchao Yang, Helen Meng, Jincenzi Wu, Junan Li, Tianhua Zhang, Xueyuan Chen","submitted_at":"2025-06-05T09:09:36Z","abstract_excerpt":"Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 5,000 audio-question-answer triplets have been meticulously curated to fairly and comprehensively represent the targeted linguistic phenomena without introducing selection bias or annotation artifacts that would distort model comparisons.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MMSU is a new benchmark with 5,000 curated audio-QA pairs across 47 linguistically grounded tasks that reveals substantial limitations in existing SpeechLLMs for fine-grained spoken language understanding and reasoning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MMSU benchmark shows current SpeechLLMs have substantial room for improvement in fine-grained spoken language understanding and reasoning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3f637408ee71faa0360be6441e48eb7f78d557260d0a65e837a766dbff650d78"},"source":{"id":"2506.04779","kind":"arxiv","version":3},"verdict":{"id":"82e7f7c0-b112-499e-80ee-1ee5f79d68bf","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T17:16:48.388929Z","strongest_claim":"Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization.","one_line_summary":"MMSU is a new benchmark with 5,000 curated audio-QA pairs across 47 linguistically grounded tasks that reveals substantial limitations in existing SpeechLLMs for fine-grained spoken language understanding and reasoning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 5,000 audio-question-answer triplets have been meticulously curated to fairly and comprehensively represent the targeted linguistic phenomena without introducing selection bias or annotation artifacts that would distort model comparisons.","pith_extraction_headline":"MMSU benchmark shows current SpeechLLMs have substantial room for improvement in fine-grained spoken language understanding and reasoning."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"dca2635067f684e6bff1cf0a06edad606b39ef3ad0e6c1fdacc896b8f4b952db"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2506.04779","created_at":"2026-05-17T23:38:13.516099+00:00"},{"alias_kind":"arxiv_version","alias_value":"2506.04779v3","created_at":"2026-05-17T23:38:13.516099+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2506.04779","created_at":"2026-05-17T23:38:13.516099+00:00"},{"alias_kind":"pith_short_12","alias_value":"BT6II3QL27YC","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"BT6II3QL27YCFVKH","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"BT6II3QL","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2603.17837","citing_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12034","citing_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2507.08128","citing_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12036","citing_title":"Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12034","citing_title":"Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25719","citing_title":"Step-Audio-R1.5 Technical Report","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25591","citing_title":"Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06631","citing_title":"Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23717","citing_title":"HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08209","citing_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07593","citing_title":"TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2509.17765","citing_title":"Qwen3-Omni Technical Report","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12527","citing_title":"Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14548","citing_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15804","citing_title":"Qwen3.5-Omni Technical Report","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16659","citing_title":"Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20842","citing_title":"SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation","ref_index":27,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP","json":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP.json","graph_json":"https://pith.science/api/pith-number/BT6II3QL27YCFVKH5JUSPUGTZP/graph.json","events_json":"https://pith.science/api/pith-number/BT6II3QL27YCFVKH5JUSPUGTZP/events.json","paper":"https://pith.science/paper/BT6II3QL"},"agent_actions":{"view_html":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP","download_json":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP.json","view_paper":"https://pith.science/paper/BT6II3QL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2506.04779&json=true","fetch_graph":"https://pith.science/api/pith-number/BT6II3QL27YCFVKH5JUSPUGTZP/graph.json","fetch_events":"https://pith.science/api/pith-number/BT6II3QL27YCFVKH5JUSPUGTZP/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP/action/storage_attestation","attest_author":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP/action/author_attestation","sign_citation":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP/action/citation_signature","submit_replication":"https://pith.science/pith/BT6II3QL27YCFVKH5JUSPUGTZP/action/replication_record"}},"created_at":"2026-05-17T23:38:13.516099+00:00","updated_at":"2026-05-17T23:38:13.516099+00:00"}