AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
LightAVSeg decouples semantic filtering and spatial grounding to achieve linear-cost cross-modal interaction in audio-visual segmentation, reaching 50.4 mIoU on MS3 with 20.5M parameters as a new lightweight state-of-the-art.
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.
citing papers explorer
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
LightAVSeg: Lightweight Audio-Visual Segmentation
LightAVSeg decouples semantic filtering and spatial grounding to achieve linear-cost cross-modal interaction in audio-visual segmentation, reaching 50.4 mIoU on MS3 with 20.5M parameters as a new lightweight state-of-the-art.
-
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training
GST uses gradient-based affinity metrics to form dataset groups and applies progressive scheduling, achieving 30-40% faster convergence than uniform mixture training on 14 AudioQA datasets while matching or exceeding performance.