Introduces DSFA to turn deterministic audio features stochastic during fine-tuning and the CoSG ExtEval dataset, claiming SOTA generalization for CodecFake detection.
Felix Wu, Kwangyoun Kim, Shinji Watanabe, Kyu J
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8roles
background 2polarities
background 2representative citing papers
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and emotional fidelity.
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.
citing papers explorer
-
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.
-
On The Landscape of Spoken Language Models: A Comprehensive Survey
A literature survey that organizes spoken language models by architecture, training, and evaluation choices and identifies key challenges and future directions.