FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
hub
Hifi-codec: Group-residual vector quantization for high fidelity audio codec
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
AffectCodec applies block-diagonal projections in residual FSQ to explicitly allocate bits to emotion and acoustic subspaces, combined with emotion conditioning, yielding better emotion preservation at low bitrates with competitive acoustic quality.
ClariCodec achieves 3.55% WER on LibriSpeech test-clean at 300 bps by RL fine-tuning the encoder for intelligibility, yielding a 23% relative WER reduction while preserving perceptual quality.
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
PairAlign learns compact variable-length token sequences for audio via self-alignment on paired content-preserving views, achieving 55% fewer archive tokens than VQ while preserving edit-distance retrieval at 12.71 tokens/s.
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.
DASB is a new benchmark for discrete audio tokens showing semantic tokens outperform acoustic ones but discrete representations remain less robust than continuous features across domains.
Self-guidance adds a lightweight feature-mapping loss to align decoder manifolds in VQ-VAE speech codecs, raising reconstruction metrics and allowing 4x codebook reduction with no fidelity loss.
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.
UniAudio-Token enhances semantic speech tokenizers with general audio perception using SAP decomposition and SAE gating, outperforming single-codebook baselines when integrated with LLMs on understanding and generation tasks.
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
citing papers explorer
-
UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception
UniAudio-Token enhances semantic speech tokenizers with general audio perception using SAP decomposition and SAE gating, outperforming single-codebook baselines when integrated with LLMs on understanding and generation tasks.