MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Correct"
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Unified Audio Schema adds structured paralinguistic and event labels to audio training data, raising fine-grained perception scores by 10.9% on MMSU while keeping reasoning intact.
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Full-Duplex-Bench-v3 provides a dataset of real human audio with five disfluency types and chained API tasks to benchmark six voice agent systems, revealing GPT-Realtime leads in accuracy while cascaded pipelines suffer highest latency.
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
citing papers explorer
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
Unified Audio Schema adds structured paralinguistic and event labels to audio training data, raising fine-grained perception scores by 10.9% on MMSU while keeping reasoning intact.
-
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
-
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
-
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Full-Duplex-Bench-v3 provides a dataset of real human audio with five disfluency types and chained API tasks to benchmark six voice agent systems, revealing GPT-Realtime leads in accuracy while cascaded pipelines suffer highest latency.
-
A Survey of Audio Reasoning in Multimodal Foundation Models
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.