Correct"

· 2025 · arXiv 2506.21875

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

cs.CL · 2026-04-14 · conditional · novelty 7.0

Unified Audio Schema adds structured paralinguistic and event labels to audio training data, raising fine-grained perception scores by 10.9% on MMSU while keeping reasoning intact.

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

eess.AS · 2025-09-30 · unverdicted · novelty 7.0

Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 5.0

DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

cs.SD · 2026-05-18 · unverdicted · novelty 5.0

A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

eess.AS · 2026-04-06 · unverdicted · novelty 4.0

Full-Duplex-Bench-v3 provides a dataset of real human audio with five disfluency types and chained API tasks to benchmark six voice agent systems, revealing GPT-Realtime leads in accuracy while cascaded pipelines suffer highest latency.

A Survey of Audio Reasoning in Multimodal Foundation Models

eess.AS · 2026-05-20 · unverdicted · novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

citing papers explorer

Showing 8 of 8 citing papers.

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes cs.CL · 2026-05-07 · unverdicted · none · ref 60
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs cs.CL · 2026-04-14 · conditional · none · ref 4
Unified Audio Schema adds structured paralinguistic and event labels to audio training data, raising fine-grained perception scores by 10.9% on MMSU while keeping reasoning intact.
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models eess.AS · 2025-09-30 · unverdicted · none · ref 42
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 87
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action eess.AS · 2026-05-20 · unverdicted · none · ref 33
DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 191
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency eess.AS · 2026-04-06 · unverdicted · none · ref 3
Full-Duplex-Bench-v3 provides a dataset of real human audio with five disfluency types and chained API tasks to benchmark six voice agent systems, revealing GPT-Realtime leads in accuracy while cascaded pipelines suffer highest latency.
A Survey of Audio Reasoning in Multimodal Foundation Models eess.AS · 2026-05-20 · unverdicted · none · ref 128
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Correct"

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer