pith. machine review for the scientific record. sign in

arxiv: 2605.11098 · v1 · submitted 2026-05-11 · 💻 cs.SD

Recognition: 2 theorem links

· Lean Theorem

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:01 UTC · model grok-4.3

classification 💻 cs.SD
keywords neural speech codecemotion preservationexpressive speechspeech compressionlatent modulationknowledge distillationsemantic alignmentspeech modeling
0
0 comments X

The pith

A neural speech codec preserves emotional information in compressed speech by guiding its latent representations with semantic and emotional cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create discrete speech representations that keep emotional expressiveness intact for use in speech language models. Existing codecs focus on acoustic reconstruction and lose these cues during quantization, limiting downstream tasks like expressive text-to-speech. The approach introduces three components that adjust and align the latent features to prioritize emotionally salient information alongside content and prosody. If the claim holds, compressed speech tokens would support more natural and emotionally consistent modeling without extra emotion-specific processing. Evaluations on reconstruction quality, emotion recognition accuracy, and generated speech confirm gains in consistency while holding semantic fidelity steady.

Core claim

We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

What carries the argument

Emotion-guided neural speech codec framework that uses emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotional cues during quantization.

If this is right

  • Speech reconstruction achieves higher emotion consistency scores.
  • Emotion recognition models perform better on the discrete tokens produced by the codec.
  • Text-to-speech systems built on the codec outputs generate more expressive and natural speech.
  • Semantic content accuracy stays comparable to baseline codecs.
  • The discrete representations become more suitable for speech language models that require emotional context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integration into larger speech models could reduce the need for separate emotion extraction stages in pipelines.
  • The guidance approach might extend to preserving other paralinguistic features such as speaker identity during compression.
  • Real-world voice interfaces could produce more affectively appropriate responses if the codec is adopted at scale.

Load-bearing premise

The three modules will retain emotional cues under compression without degrading semantic fidelity or prosodic naturalness.

What would settle it

Direct tests showing no improvement in emotion recognition accuracy on reconstructed samples from the new codec compared to standard neural codecs, with equivalent reconstruction quality.

Figures

Figures reproduced from arXiv: 2605.11098 by Hongfei Du, Jiacheng Shi, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao.

Figure 1
Figure 1. Figure 1: Overview of the proposed emotion-guided neural speech codec. The codec encodes input speech into [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Reconstruction subjective evaluation results across three complementary settings. (a) MUSHRA scores comparing Encodec, Llasa, our method, and ground-truth recordings, evaluating overall perceptual quality under a reference-based protocol. (b) MOS and Emotion-MOS results assessing naturalness and affective expressiveness across competing systems. (c) AB-preference results measuring pairwise perceptual prefe… view at source ↗
Figure 3
Figure 3. Figure 3: Text To Speech subjective evaluation results across two complementary settings. (a) MOS and Emotion￾MOS results assessing naturalness and affective expressiveness across competing systems. (b) AB-preference results measuring pairwise perceptual preference and emotional preference. sistency and perceptual quality. In contrast, the full Cross–Attn–Before configuration applies cross attention prior to project… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of reconstructed spectrograms across different codecs. The figure visualizes mel-spectrograms of the same speech segment reconstructed by (a) the natural reference, (b) our method, (c) DAC, and (d) EnCodec. Low-frequency regions associated with prosodic and emotional cues are highlighted for comparison, illustrating differences in temporal continuity and spectral stability across mod… view at source ↗
read the original abstract

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces AffectCodec, an emotion-guided neural speech codec for preserving emotional cues in discrete representations used by speech language models. It proposes a framework with three components—emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment—to retain emotionally salient information under quantization while preserving semantic fidelity and prosodic naturalness. Evaluations across speech reconstruction, emotion recognition, and downstream TTS tasks are reported to show gains in emotion consistency and perceptual quality without content degradation.

Significance. If the empirical results hold, the work is significant because current neural codecs primarily optimize acoustic reconstruction and often degrade emotional expressiveness, limiting their utility for expressive speech modeling and SLMs. The explicit modeling of emotion-semantic relations at the latent level addresses a clear gap and could improve downstream tasks such as emotion-aware TTS. The multi-faceted evaluation (reconstruction, recognition, and generation) provides a reasonable test of the central claim.

minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from explicit citation of the specific datasets (e.g., IEMOCAP, ESD) and quantitative metrics (e.g., emotion recognition accuracy, MOS, WER) used in the three evaluation tracks to allow immediate assessment of the claimed improvements.
  2. [Section 3] Notation for the three proposed modules is introduced without a consolidated table or diagram showing their interactions with the base codec encoder/decoder; a single overview figure would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the recommendation for minor revision. The provided summary accurately captures the motivation, proposed framework, and evaluation strategy for AffectCodec. As no specific major comments were raised in the report, we have no individual points requiring rebuttal or clarification at this stage. We remain available to address any minor editorial or technical suggestions during the revision process.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new emotion-guided neural speech codec framework consisting of three modules (emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment). Claims of improved emotion consistency and perceptual quality rest on empirical evaluations across reconstruction, emotion recognition, and downstream TTS tasks rather than any derivation, prediction, or first-principles result that reduces to its own inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided abstract or description. The work is self-contained with independent empirical support.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only the abstract is available, so the ledger reflects the high-level framework components described; no numerical parameters, external axioms, or independent evidence for new entities are provided.

axioms (1)
  • standard math Standard neural network training assumptions for audio processing tasks
    Implicit in any deep learning model for speech coding and emotion modeling.
invented entities (3)
  • emotion-semantic guided latent modulation no independent evidence
    purpose: To incorporate emotional information into the latent representation
    Introduced as a core component of the proposed codec.
  • relation-preserving emotional-semantic distillation no independent evidence
    purpose: To maintain emotional relations during knowledge distillation
    New distillation technique proposed in the framework.
  • emotion-weighted semantic alignment no independent evidence
    purpose: To align semantic and emotional features with emotion-based weighting
    New alignment method introduced to retain emotional cues.

pith-pipeline@v0.9.0 · 5408 in / 1299 out tokens · 65681 ms · 2026-05-13T01:01:00.943721+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 7 internal anchors

  1. [1]

    ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Learning arousal-valence representation from categorical emotion labels of speech , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

  2. [2]

    2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , year=

    Librispeech: an asr corpus based on public domain audio books , author=. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , year=

  3. [3]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Libritts: A corpus derived from librispeech for text-to-speech , author=. arXiv preprint arXiv:1904.02882 , year=

  4. [4]

    The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web

    CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92) , author=. The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive:(http://web. ku. edu/\

  5. [5]

    arXiv preprint arXiv:2010.11567 , year=

    Aishell-3: A multi-speaker mandarin tts corpus and the baselines , author=. arXiv preprint arXiv:2010.11567 , year=

  6. [6]

    2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

    Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

  7. [7]

    IEEE Transactions on Affective Computing , volume=

    Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings , author=. IEEE Transactions on Affective Computing , volume=. 2017 , publisher=

  8. [8]

    Multimodal Language Analysis in the Wild: CMU - MOSEI Dataset and Interpretable Dynamic Fusion Graph

    Bagher Zadeh, AmirAli and Liang, Paul Pu and Poria, Soujanya and Cambria, Erik and Morency, Louis-Philippe. Multimodal Language Analysis in the Wild: CMU - MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1208

  9. [9]

    arXiv preprint arXiv:2402.13018 , year=

    EMO-SUPERB: An in-depth look at speech emotion recognition , author=. arXiv preprint arXiv:2402.13018 , year=

  10. [10]

    Proceedings of the 33rd ACM International Conference on Multimedia , pages=

    Emovoice: Llm-based emotional text-to-speech model with freestyle text prompting , author=. Proceedings of the 33rd ACM International Conference on Multimedia , pages=

  11. [11]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Secap: Speech emotion captioning with large language model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  12. [12]

    High Fidelity Neural Audio Compression

    High fidelity neural audio compression , author=. arXiv preprint arXiv:2210.13438 , year=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    arXiv preprint arXiv:2403.03100 , year=

    Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models , author=. arXiv preprint arXiv:2403.03100 , year=

  15. [15]

    Speechtokenizer: Unified speech tokenizer for speech language models , author=. Proc. Int. Conf. Learn. Representations , pages=

  16. [16]

    Moshi: a speech-text foundation model for real-time dialogue

    Moshi: a speech-text foundation model for real-time dialogue , author=. arXiv preprint arXiv:2410.00037 , year=

  17. [17]

    BigCodec: Pushing the limits of low-bitrate neural speech codec.arXiv preprint arXiv:2409.05377,

    Bigcodec: Pushing the limits of low-bitrate neural speech codec , author=. arXiv preprint arXiv:2409.05377 , year=

  18. [18]

    Scaling transformers for low-bitrate high-quality speech coding.arXiv preprint arXiv:2411.19842,

    Scaling transformers for low-bitrate high-quality speech coding , author=. arXiv preprint arXiv:2411.19842 , year=

  19. [19]

    Llasa: Scaling train-time and inference- time compute for llama-based speech synthesis,

    Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis , author=. arXiv preprint arXiv:2502.04128 , year=

  20. [20]

    Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello

    Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling , author=. arXiv preprint arXiv:2408.16532 , year=

  21. [21]

    , author=

    From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. , author=. Interspeech , pages=

  22. [22]

    arXiv preprint arXiv:2312.05187 , year=

    Seamless: Multilingual Expressive and Streaming Speech Translation , author=. arXiv preprint arXiv:2312.05187 , year=

  23. [23]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    emotion2vec: Self-supervised pre-training for speech emotion representation , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  24. [24]

    2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

    Laugh now cry later: Controlling time-varying emotional states of flow-matching-based zero-shot text-to-speech , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

  25. [25]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Neural codec language models are zero-shot text to speech synthesizers , author=. arXiv preprint arXiv:2301.02111 , year=

  26. [26]

    arXiv preprint arXiv:2402.09378 , year=

    Mobilespeech: A fast and high-fidelity framework for mobile zero-shot text-to-speech , author=. arXiv preprint arXiv:2402.09378 , year=

  27. [27]

    Qwen2.5-Omni Technical Report

    Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

  28. [28]

    Kimi-Audio Technical Report

    Kimi-audio technical report , author=. arXiv preprint arXiv:2504.18425 , year=

  29. [29]

    Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

    Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot , author=. arXiv preprint arXiv:2412.02612 , year=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Uniaudio 1.5: Large language model-driven audio codec is a few-shot audio task learner , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

    Soundstream: An end-to-end neural audio codec , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2021 , publisher=

  32. [32]

    2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=

    Emo-codec: An in-depth look at emotion preservation capacity of legacy and neural codec models with subjective and objective evaluations , author=. 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages=. 2024 , organization=

  33. [33]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Codec-SUPERB: An in-depth analysis of sound codec models , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  34. [34]

    IEEE/ACM transactions on audio, speech, and language processing , volume=

    Audiolm: a language modeling approach to audio generation , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=. 2023 , publisher=

  35. [35]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Codec does matter: Exploring the semantic shortcoming of codec for audio language model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  36. [36]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Distilvpr: Cross-modal knowledge distillation for visual place recognition , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  37. [37]

    arXiv preprint arXiv:2509.09791 , year=

    The msp-podcast corpus , author=. arXiv preprint arXiv:2509.09791 , year=

  38. [38]

    Advances in neural information processing systems , volume=

    wav2vec 2.0: A framework for self-supervised learning of speech representations , author=. Advances in neural information processing systems , volume=

  39. [39]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  40. [40]

    IEEE/ACM transactions on audio, speech, and language processing , volume=

    Hubert: Self-supervised speech representation learning by masked prediction of hidden units , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=. 2021 , publisher=

  41. [41]

    IEEE Journal of Selected Topics in Signal Processing , volume=

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2022 , publisher=

  42. [42]

    Journal of the Audio Engineering Society , volume=

    Perceptual Evaluation of Speech Quality (PESQ) The New ITU Standard for End-to-End Speech Quality Assessment Part I--Time-Delay Compensation , author=. Journal of the Audio Engineering Society , volume=. 2002 , publisher=

  43. [43]

    Language resources and evaluation , volume=

    IEMOCAP: Interactive emotional dyadic motion capture database , author=. Language resources and evaluation , volume=. 2008 , publisher=

  44. [44]

    IEEE transactions on affective computing , volume=

    Crema-d: Crowd-sourced emotional multimodal actors dataset , author=. IEEE transactions on affective computing , volume=. 2014 , publisher=

  45. [45]

    IEEE Transactions on Affective Computing , volume=

    MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception , author=. IEEE Transactions on Affective Computing , volume=. 2016 , publisher=

  46. [46]

    2017 Seventh international conference on affective computing and intelligent interaction (ACII) , pages=

    NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus , author=. 2017 Seventh international conference on affective computing and intelligent interaction (ACII) , pages=. 2017 , organization=

  47. [47]

    2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII) , pages=

    An intelligent infrastructure toward large scale naturalistic affective speech corpora collection , author=. 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII) , pages=. 2023 , organization=

  48. [48]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Cosyvoice 2: Scalable streaming speech synthesis with large language models , author=. arXiv preprint arXiv:2412.10117 , year=

  49. [49]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  50. [50]

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer , author=. arXiv preprint arXiv:2409.00750 , year=

  51. [51]

    Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications.arXiv preprint arXiv:2409.03283, 2024

    Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications , author=. arXiv preprint arXiv:2409.03283 , year=

  52. [52]

    Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens.arXiv preprint arXiv:2503.01710, 2025

    Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens , author=. arXiv preprint arXiv:2503.01710 , year=

  53. [53]

    ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Emo-dpo: Controllable emotional speech synthesis through direct preference optimization , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2025 , organization=

  54. [54]

    Russian Emotional Speech Dialogs with annotated text

    Lubenets, Ilya and Davidchuk, Nikita and Amentes, Artem , license =. Russian Emotional Speech Dialogs with annotated text. , year=

  55. [55]

    Proceedings of the ninth international conference on language resources and evaluation (LREC'14) , pages=

    EMOVO corpus: an Italian emotional speech database , author=. Proceedings of the ninth international conference on language resources and evaluation (LREC'14) , pages=. 2014 , organization=

  56. [56]

    2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) , pages=

    The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning , author=. 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) , pages=. 2021 , organization=

  57. [57]

    Language Resources and Evaluation , volume=

    ShEMO: a large-scale validated database for Persian speech emotion detection , author=. Language Resources and Evaluation , volume=. 2019 , publisher=

  58. [58]

    Advances in neural information processing systems , volume=

    Neural discrete representation learning , author=. Advances in neural information processing systems , volume=

  59. [59]

    arXiv preprint arXiv:2305.02765 , year=

    Hifi-codec: Group-residual vector quantization for high fidelity audio codec , author=. arXiv preprint arXiv:2305.02765 , year=

  60. [60]

    arXiv preprint arXiv:2507.04048 , year=

    Clep-dg: Contrastive learning for speech emotion domain generalization via soft prompt tuning , author=. arXiv preprint arXiv:2507.04048 , year=

  61. [61]

    arXiv preprint arXiv:2406.03872 , year=

    Blsp-emo: Towards empathetic large speech-language models , author=. arXiv preprint arXiv:2406.03872 , year=

  62. [62]

    The interspeech 2009 emotion challenge , author=

  63. [63]

    Speech communication , volume=

    Emotional speech recognition: Resources, features, and methods , author=. Speech communication , volume=. 2006 , publisher=

  64. [64]

    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , year=

    End-to-end emotional speech synthesis using style tokens and semi-supervised training , author=. 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , year=

  65. [65]

    international conference on machine learning , pages=

    Towards end-to-end prosody transfer for expressive speech synthesis with tacotron , author=. international conference on machine learning , pages=. 2018 , organization=

  66. [66]

    EURASIP Journal on Audio, Speech, and Music Processing , volume=

    ViSQOL: an objective speech quality model , author=. EURASIP Journal on Audio, Speech, and Music Processing , volume=. 2015 , publisher=

  67. [67]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Estimating or propagating gradients through stochastic neurons for conditional computation , author=. arXiv preprint arXiv:1308.3432 , year=

  68. [68]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Fast and accurate deep network learning by exponential linear units (elus) , author=. arXiv preprint arXiv:1511.07289 , volume=

  69. [69]

    Advances in neural information processing systems , volume=

    Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis , author=. Advances in neural information processing systems , volume=

  70. [70]

    Digital Signal Processing , volume=

    Analysis of constant-Q filterbank based representations for speech emotion recognition , author=. Digital Signal Processing , volume=. 2022 , publisher=

  71. [71]

    2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) , pages=

    Emotion recognition from speech signals by Mel-spectrogram and a CNN-RNN , author=. 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) , pages=. 2024 , organization=

  72. [72]

    ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

  73. [73]

    ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Clap learning audio concepts from natural language supervision , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

  74. [74]

    ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Audioclip: Extending clip to image, text and audio , author=. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2022 , organization=

  75. [75]

    ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

    Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

  76. [76]

    arXiv preprint arXiv:2509.25416 , year=

    Emotion-Aligned Generation in Diffusion Text to Speech Models via Preference-Guided Optimization , author=. arXiv preprint arXiv:2509.25416 , year=

  77. [77]

    ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

  78. [78]

    ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Audiodec: An open-source streaming high-fidelity neural audio codec , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2023 , organization=

  79. [79]

    ISO-MPEG-1 audio: A generic standard for coding of high-quality digital audio , author=. J. Audio Eng. Soc. , volume=

  80. [80]

    arXiv preprint arXiv:1602.04845 , year=

    High-quality, low-delay music coding in the opus codec , author=. arXiv preprint arXiv:1602.04845 , year=

Showing first 80 references.