FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
hub
Utmos: Utokyo-sarulab system for voicemos challenge 2022
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
Replacing early-reflected speech with time-shifted anechoic clean speech as the training target, combined with a two-stage distortion-perception framework, yields state-of-the-art universal speech enhancement.
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
Random phoneme substitutions recover most ASR gains from synthetic accented speech, with targeted edits and ground-truth prosody providing only marginal additional benefits.
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
OLIVE is a new self-supervised speech representation framework that unifies view-augmented masked latent prediction with waveform reconstruction under one objective.
F5-TTS-DPS integrates EMA and dual-scoring prompt selection into F5-TTS to produce in-the-wild TTS that achieves the best a-DCF scores (0.1582, 0.5233, 0.2562) on three SASV systems in the WildSpoof challenge.
citing papers explorer
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.