arxiv: 2605.11098 · v1 · submitted 2026-05-11 · 💻 cs.SD

Recognition: 2 theorem links

· Lean Theorem

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi , Hongfei Du , Xinyuan Song , Y. Alicia Hong , Yanfu Zhang , Ye Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:01 UTC · model grok-4.3

classification 💻 cs.SD

keywords neural speech codecemotion preservationexpressive speechspeech compressionlatent modulationknowledge distillationsemantic alignmentspeech modeling

0 comments

The pith

A neural speech codec preserves emotional information in compressed speech by guiding its latent representations with semantic and emotional cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create discrete speech representations that keep emotional expressiveness intact for use in speech language models. Existing codecs focus on acoustic reconstruction and lose these cues during quantization, limiting downstream tasks like expressive text-to-speech. The approach introduces three components that adjust and align the latent features to prioritize emotionally salient information alongside content and prosody. If the claim holds, compressed speech tokens would support more natural and emotionally consistent modeling without extra emotion-specific processing. Evaluations on reconstruction quality, emotion recognition accuracy, and generated speech confirm gains in consistency while holding semantic fidelity steady.

Core claim

We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

What carries the argument

Emotion-guided neural speech codec framework that uses emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotional cues during quantization.

If this is right

Speech reconstruction achieves higher emotion consistency scores.
Emotion recognition models perform better on the discrete tokens produced by the codec.
Text-to-speech systems built on the codec outputs generate more expressive and natural speech.
Semantic content accuracy stays comparable to baseline codecs.
The discrete representations become more suitable for speech language models that require emotional context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration into larger speech models could reduce the need for separate emotion extraction stages in pipelines.
The guidance approach might extend to preserving other paralinguistic features such as speaker identity during compression.
Real-world voice interfaces could produce more affectively appropriate responses if the codec is adopted at scale.

Load-bearing premise

The three modules will retain emotional cues under compression without degrading semantic fidelity or prosodic naturalness.

What would settle it

Direct tests showing no improvement in emotion recognition accuracy on reconstructed samples from the new codec compared to standard neural codecs, with equivalent reconstruction quality.

Figures

Figures reproduced from arXiv: 2605.11098 by Hongfei Du, Jiacheng Shi, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao.

**Figure 2.** Figure 2: Reconstruction subjective evaluation results across three complementary settings. (a) MUSHRA scores comparing Encodec, Llasa, our method, and ground-truth recordings, evaluating overall perceptual quality under a reference-based protocol. (b) MOS and Emotion-MOS results assessing naturalness and affective expressiveness across competing systems. (c) AB-preference results measuring pairwise perceptual prefe… view at source ↗

**Figure 3.** Figure 3: Text To Speech subjective evaluation results across two complementary settings. (a) MOS and EmotionMOS results assessing naturalness and affective expressiveness across competing systems. (b) AB-preference results measuring pairwise perceptual preference and emotional preference. sistency and perceptual quality. In contrast, the full Cross–Attn–Before configuration applies cross attention prior to project… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of reconstructed spectrograms across different codecs. The figure visualizes mel-spectrograms of the same speech segment reconstructed by (a) the natural reference, (b) our method, (c) DAC, and (d) EnCodec. Low-frequency regions associated with prosodic and emotional cues are highlighted for comparison, illustrating differences in temporal continuity and spectral stability across mod… view at source ↗

read the original abstract

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces AffectCodec, an emotion-guided neural speech codec for preserving emotional cues in discrete representations used by speech language models. It proposes a framework with three components—emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment—to retain emotionally salient information under quantization while preserving semantic fidelity and prosodic naturalness. Evaluations across speech reconstruction, emotion recognition, and downstream TTS tasks are reported to show gains in emotion consistency and perceptual quality without content degradation.

Significance. If the empirical results hold, the work is significant because current neural codecs primarily optimize acoustic reconstruction and often degrade emotional expressiveness, limiting their utility for expressive speech modeling and SLMs. The explicit modeling of emotion-semantic relations at the latent level addresses a clear gap and could improve downstream tasks such as emotion-aware TTS. The multi-faceted evaluation (reconstruction, recognition, and generation) provides a reasonable test of the central claim.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit citation of the specific datasets (e.g., IEMOCAP, ESD) and quantitative metrics (e.g., emotion recognition accuracy, MOS, WER) used in the three evaluation tracks to allow immediate assessment of the claimed improvements.
[Section 3] Notation for the three proposed modules is introduced without a consolidated table or diagram showing their interactions with the base codec encoder/decoder; a single overview figure would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The provided summary accurately captures the motivation, proposed framework, and evaluation strategy for AffectCodec. As no specific major comments were raised in the report, we have no individual points requiring rebuttal or clarification at this stage. We remain available to address any minor editorial or technical suggestions during the revision process.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new emotion-guided neural speech codec framework consisting of three modules (emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment). Claims of improved emotion consistency and perceptual quality rest on empirical evaluations across reconstruction, emotion recognition, and downstream TTS tasks rather than any derivation, prediction, or first-principles result that reduces to its own inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided abstract or description. The work is self-contained with independent empirical support.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only the abstract is available, so the ledger reflects the high-level framework components described; no numerical parameters, external axioms, or independent evidence for new entities are provided.

axioms (1)

standard math Standard neural network training assumptions for audio processing tasks
Implicit in any deep learning model for speech coding and emotion modeling.

invented entities (3)

emotion-semantic guided latent modulation no independent evidence
purpose: To incorporate emotional information into the latent representation
Introduced as a core component of the proposed codec.
relation-preserving emotional-semantic distillation no independent evidence
purpose: To maintain emotional relations during knowledge distillation
New distillation technique proposed in the framework.
emotion-weighted semantic alignment no independent evidence
purpose: To align semantic and emotional features with emotion-based weighting
New alignment method introduced to retain emotional cues.

pith-pipeline@v0.9.0 · 5408 in / 1299 out tokens · 65681 ms · 2026-05-13T01:01:00.943721+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lrela = 1/T'² Σ α d(r(1)t,t′, remo t,t′) + β d(r(1)t,t′, rsem t,t′)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 7 internal anchors

[1]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Learning arousal-valence representation from categorical emotion labels of speech , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[2]

2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , year=

Librispeech: an asr corpus based on public domain audio books , author=. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , year=

work page 2015
[3]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Libritts: A corpus derived from librispeech for text-to-speech , author=. arXiv preprint arXiv:1904.02882 , year=

work page Pith review arXiv 1904
[4]

The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web

CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) , author=. The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/\

work page
[5]

arXiv preprint arXiv:2010.11567 , year=

Aishell-3: A multi-speaker mandarin tts corpus and the baselines , author=. arXiv preprint arXiv:2010.11567 , year=

work page arXiv 2010
[6]

2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

work page 2017
[7]

IEEE Transactions on Affective Computing , volume=

Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings , author=. IEEE Transactions on Affective Computing , volume=. 2017 , publisher=

work page 2017
[8]

Multimodal Language Analysis in the Wild: CMU - MOSEI Dataset and Interpretable Dynamic Fusion Graph

Bagher Zadeh, AmirAli and Liang, Paul Pu and Poria, Soujanya and Cambria, Erik and Morency, Louis-Philippe. Multimodal Language Analysis in the Wild: CMU - MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1208

work page doi:10.18653/v1/p18-1208 2018
[9]

arXiv preprint arXiv:2402.13018 , year=

EMO-SUPERB: An in-depth look at speech emotion recognition , author=. arXiv preprint arXiv:2402.13018 , year=

work page arXiv
[10]

Proceedings of the 33rd ACM International Conference on Multimedia , pages=

Emovoice: Llm-based emotional text-to-speech model with freestyle text prompting , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

work page
[11]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Secap: Speech emotion captioning with large language model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[12]

High Fidelity Neural Audio Compression

High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=

work page internal anchor Pith review arXiv
[13]

Advances in Neural Information Processing Systems , volume=

High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

arXiv preprint arXiv:2403.03100 , year=

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models , author=. arXiv preprint arXiv:2403.03100 , year=

work page arXiv
[15]

Speechtokenizer: Unified speech tokenizer for speech language models , author=. Proc. Int. Conf. Learn. Representations , pages=

work page
[16]

Moshi: a speech-text foundation model for real-time dialogue

Moshi: a speech-text foundation model for real-time dialogue , author=. arXiv preprint arXiv:2410.00037 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

BigCodec: Pushing the limits of low-bitrate neural speech codec.arXiv preprint arXiv:2409.05377,

Bigcodec: Pushing the limits of low-bitrate neural speech codec , author=. arXiv preprint arXiv:2409.05377 , year=

work page arXiv
[18]

Scaling transformers for low-bitrate high-quality speech coding.arXiv preprint arXiv:2411.19842,

Scaling transformers for low-bitrate high-quality speech coding , author=. arXiv preprint arXiv:2411.19842 , year=

work page arXiv
[19]

Llasa: Scaling train-time and inference- time compute for llama-based speech synthesis,

Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis , author=. arXiv preprint arXiv:2502.04128 , year=

work page arXiv
[20]

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling , author=. arXiv preprint arXiv:2408.16532 , year=

work page arXiv
[21]

, author=

From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. , author=. Interspeech , pages=

work page
[22]

arXiv preprint arXiv:2312.05187 , year=

Seamless: Multilingual Expressive and Streaming Speech Translation , author=. arXiv preprint arXiv:2312.05187 , year=

work page arXiv
[23]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

emotion2vec: Self-supervised pre-training for speech emotion representation , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[24]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

Laugh now cry later: Controlling time-varying emotional states of flow-matching-based zero-shot text-to-speech , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

work page 2024
[25]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Neural codec language models are zero-shot text to speech synthesizers , author=. arXiv preprint arXiv:2301.02111 , year=

work page internal anchor Pith review arXiv
[26]

arXiv preprint arXiv:2402.09378 , year=

Mobilespeech: A fast and high-fidelity framework for mobile zero-shot text-to-speech , author=. arXiv preprint arXiv:2402.09378 , year=

work page arXiv
[27]

Qwen2.5-Omni Technical Report

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Kimi-Audio Technical Report

Kimi-audio technical report , author=. arXiv preprint arXiv:2504.18425 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot , author=. arXiv preprint arXiv:2412.02612 , year=

work page arXiv
[30]

Advances in Neural Information Processing Systems , volume=

Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

work page 2021
[32]

2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=

Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations , author=. 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=. 2024 , organization=

work page 2024
[33]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Codec-SUPERB: An in-depth analysis of sound codec models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

work page 2024
[34]

IEEE/ACM transactions on audio, speech, and language processing , volume=

Audiolm: a language modeling approach to audio generation , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=. 2023 , publisher=

work page 2023
[35]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Codec does matter: Exploring the semantic shortcoming of codec for audio language model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[36]

Proceedings of the AAAI conference on artificial intelligence , volume=

Distilvpr: Cross-modal knowledge distillation for visual place recognition , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[37]

arXiv preprint arXiv:2509.09791 , year=

The msp-podcast corpus , author=. arXiv preprint arXiv:2509.09791 , year=

work page arXiv
[38]

Advances in neural information processing systems , volume=

wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=

work page
[39]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

work page 2019
[40]

IEEE/ACM transactions on audio, speech, and language processing , volume=

Hubert: Self-supervised speech representation learning by masked prediction of hidden units , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=. 2021 , publisher=

work page 2021
[41]

IEEE Journal of Selected Topics in Signal Processing , volume=

Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

work page 2022
[42]

Journal of the Audio Engineering Society , volume=

Perceptual Evaluation of Speech Quality (PESQ) The New ITU Standard for End-to-End Speech Quality Assessment Part I--Time-Delay Compensation , author=. Journal of the Audio Engineering Society , volume=. 2002 , publisher=

work page 2002
[43]

Language resources and evaluation , volume=

IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

work page 2008
[44]

IEEE transactions on affective computing , volume=

Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE transactions on affective computing , volume=. 2014 , publisher=

work page 2014
[45]

IEEE Transactions on Affective Computing , volume=

MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception , author=. IEEE Transactions on Affective Computing , volume=. 2016 , publisher=

work page 2016
[46]

2017 Seventh international conference on affective computing and intelligent interaction (ACII) , pages=

NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus , author=. 2017 Seventh international conference on affective computing and intelligent interaction (ACII) , pages=. 2017 , organization=

work page 2017
[47]

2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII) , pages=

An intelligent infrastructure toward large scale naturalistic affective speech corpora collection , author=. 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII) , pages=. 2023 , organization=

work page 2023
[48]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Cosyvoice 2: Scalable streaming speech synthesis with large language models , author=. arXiv preprint arXiv:2412.10117 , year=

work page internal anchor Pith review arXiv
[49]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[50]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Maskgct: Zero-shot text-to-speech with masked generative codec transformer , author=. arXiv preprint arXiv:2409.00750 , year=

work page arXiv
[51]

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications.arXiv preprint arXiv:2409.03283, 2024

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications , author=. arXiv preprint arXiv:2409.03283 , year=

work page arXiv
[52]

Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens , author=. arXiv preprint arXiv:2503.01710 , year=

work page arXiv
[53]

ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Emo-dpo: Controllable emotional speech synthesis through direct preference optimization , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2025 , organization=

work page 2025
[54]

Russian Emotional Speech Dialogs with annotated text

Lubenets, Ilya and Davidchuk, Nikita and Amentes, Artem , license =. Russian Emotional Speech Dialogs with annotated text. , year=

work page
[55]

Proceedings of the ninth international conference on language resources and evaluation (LREC'14) , pages=

EMOVO corpus: an Italian emotional speech database , author=. Proceedings of the ninth international conference on language resources and evaluation (LREC'14) , pages=. 2014 , organization=

work page 2014
[56]

2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) , pages=

The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning , author=. 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) , pages=. 2021 , organization=

work page 2021
[57]

Language Resources and Evaluation , volume=

ShEMO: a large-scale validated database for Persian speech emotion detection , author=. Language Resources and Evaluation , volume=. 2019 , publisher=

work page 2019
[58]

Advances in neural information processing systems , volume=

Neural discrete representation learning , author=. Advances in neural information processing systems , volume=

work page
[59]

arXiv preprint arXiv:2305.02765 , year=

Hifi-codec: Group-residual vector quantization for high fidelity audio codec , author=. arXiv preprint arXiv:2305.02765 , year=

work page arXiv
[60]

arXiv preprint arXiv:2507.04048 , year=

Clep-dg: Contrastive learning for speech emotion domain generalization via soft prompt tuning , author=. arXiv preprint arXiv:2507.04048 , year=

work page arXiv
[61]

arXiv preprint arXiv:2406.03872 , year=

Blsp-emo: Towards empathetic large speech-language models , author=. arXiv preprint arXiv:2406.03872 , year=

work page arXiv
[62]

The interspeech 2009 emotion challenge , author=

work page 2009
[63]

Speech communication , volume=

Emotional speech recognition: Resources, features, and methods , author=. Speech communication , volume=. 2006 , publisher=

work page 2006
[64]

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , year=

End-to-end emotional speech synthesis using style tokens and semi-supervised training , author=. 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , year=

work page 2019
[65]

international conference on machine learning , pages=

Towards end-to-end prosody transfer for expressive speech synthesis with tacotron , author=. international conference on machine learning , pages=. 2018 , organization=

work page 2018
[66]

EURASIP Journal on Audio, Speech, and Music Processing , volume=

ViSQOL: an objective speech quality model , author=. EURASIP Journal on Audio, Speech, and Music Processing , volume=. 2015 , publisher=

work page 2015
[67]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Fast and accurate deep network learning by exponential linear units (elus) , author=. arXiv preprint arXiv:1511.07289 , volume=

work page Pith review arXiv
[69]

Advances in neural information processing systems , volume=

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis , author=. Advances in neural information processing systems , volume=

work page
[70]

Digital Signal Processing , volume=

Analysis of constant-Q filterbank based representations for speech emotion recognition , author=. Digital Signal Processing , volume=. 2022 , publisher=

work page 2022
[71]

2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) , pages=

Emotion recognition from speech signals by Mel-spectrogram and a CNN-RNN , author=. 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) , pages=. 2024 , organization=

work page 2024
[72]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023
[73]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Clap learning audio concepts from natural language supervision , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023
[74]

ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Audioclip: Extending clip to image, text and audio , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

work page 2022
[75]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2023
[76]

arXiv preprint arXiv:2509.25416 , year=

Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization , author=. arXiv preprint arXiv:2509.25416 , year=

work page arXiv
[77]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[78]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Audiodec: An open-source streaming high-fidelity neural audio codec , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

work page 2023
[79]

ISO-MPEG-1 audio: A generic standard for coding of high-quality digital audio , author=. J. Audio Eng. Soc. , volume=

work page
[80]

arXiv preprint arXiv:1602.04845 , year=

High-quality, low-delay music coding in the opus codec , author=. arXiv preprint arXiv:1602.04845 , year=

work page arXiv

Showing first 80 references.