Semantic-Emotional Resonance Embedding: A Semi-Supervised Paradigm for Cross-Lingual Speech Emotion Recognition

Liejun Wang; Ya Zhao; Yinfeng Yu

arxiv: 2604.07417 · v1 · submitted 2026-04-08 · 💻 cs.SD · eess.AS

Semantic-Emotional Resonance Embedding: A Semi-Supervised Paradigm for Cross-Lingual Speech Emotion Recognition

Ya Zhao , Yinfeng Yu , Liejun Wang This is my paper

Pith reviewed 2026-05-10 17:14 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords cross-lingual speech emotion recognitionsemi-supervised learningsemantic-emotional embeddingresonance fieldinteraction chain loss5-shot labelingdynamic feature paradigm

0 comments

The pith

A resonance embedding transfers speech emotion recognition across languages using only five labeled source examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a semi-supervised method that builds an emotion-semantic structure from a handful of labeled speech samples in one language. It then lets unlabeled samples from other languages organize themselves into that structure through an instantaneous resonance field, while a triple-resonance interaction chain loss strengthens connections during emotional peaks. This setup removes the need for any labels or translations in the target language. If correct, the approach would allow emotion recognition systems to reach useful performance in low-resource languages without the usual expensive data collection or alignment steps.

Core claim

The paper claims that Semantic-Emotional Resonance Embedding constructs an emotion-semantic structure from a small number of labeled source samples, uses an Instantaneous Resonance Field to let unlabeled target samples self-organize into the structure for semi-supervised guidance, and applies a Triple-Resonance Interaction Chain loss to reinforce interactions and embedding between labeled and unlabeled samples during emotional highlights, enabling effective cross-lingual speech emotion recognition without target labels or translation alignment.

What carries the argument

Semantic-Emotional Resonance Embedding (SERE) with its Instantaneous Resonance Field that guides self-organization of unlabeled samples and Triple-Resonance Interaction Chain loss that reinforces labeled-unlabeled interactions.

If this is right

Only five labeled samples in the source language suffice to support the full framework.
No labels whatsoever are required for any target language.
No explicit translation or alignment between languages is necessary.
The method demonstrates gains across multiple languages in the reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The resonance mechanism could be tested on other sequential audio tasks such as speaker verification across languages.
Performance might degrade for language pairs with very different phonetic structures, providing a natural limit to check.
The self-organization idea could combine with existing unsupervised clustering methods to further reduce source labeling needs.

Load-bearing premise

Unlabeled speech from a new language will naturally form the right emotional groupings when shaped only by resonance fields derived from a few labeled examples in a different language.

What would settle it

An inspection of the learned embeddings showing that target-language samples do not cluster according to emotional categories when the resonance field component is removed.

Figures

Figures reproduced from arXiv: 2604.07417 by Liejun Wang, Ya Zhao, Yinfeng Yu.

**Figure 2.** Figure 2: Overview of the proposed SERE semi-supervised dual-path architecture for CLSER tasks. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Feature distribution of different SERE components under task C [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource performance. To address this, we propose a semi-supervised framework based on Semantic-Emotional Resonance Embedding (SERE), a cross-lingual dynamic feature paradigm that requires neither target language labels nor translation alignment. Specifically, SERE constructs an emotion-semantic structure using a small number of labeled samples. It learns human emotional experiences through an Instantaneous Resonance Field (IRF), enabling unlabeled samples to self-organize into this structure. This achieves semi-supervised semantic guidance and structural discovery. Additionally, we design a Triple-Resonance Interaction Chain (TRIC) loss to enable the model to reinforce the interaction and embedding capabilities between labeled and unlabeled samples during emotional highlights. Extensive experiments across multiple languages demonstrate the effectiveness of our method, requiring only 5-shot labeling in the source language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a semi-supervised cross-lingual SER approach that claims to work with only 5 source labels and no target supervision, but the IRF/TRIC self-organization step lacks any shown mechanism for language-invariant features.

read the letter

The headline claim is that a small set of labeled source samples can seed an emotion-semantic structure, after which an Instantaneous Resonance Field lets unlabeled target-language speech self-organize into the right emotional categories, reinforced by a Triple-Resonance Interaction Chain loss. No target labels or translations are used at all. That framing is distinct from most prior CLSER work that leans on alignment or more supervision, and it directly targets the data scarcity problem in low-resource languages. The authors deserve credit for spelling out a concrete, low-shot alternative instead of just calling for more data collection. The idea of dynamic resonance between labeled and unlabeled samples during emotional highlights is at least a fresh way to think about semi-supervised transfer in audio. If the full experiments across languages show consistent gains over baselines, the practical payoff could be real for affective systems that need to cross language boundaries quickly. The soft spot is exactly where the stress test points: nothing in the description supplies a shared embedding layer, cross-lingual contrastive term, or invariance argument that would make the resonance field reliably language-agnostic. If the underlying acoustic features remain language-specific, the unlabeled target samples have no guaranteed way to land in the correct emotion clusters built from source data alone. The abstract asserts effectiveness from experiments, yet supplies no equations, ablation tables, or error bars to let a reader judge whether the self-organization actually occurred or whether results were sensitive to the 5-shot choice. This leaves the central mechanism looking more like an internal fitting procedure than a demonstrated transfer principle. The work is aimed at speech-processing and affective-computing groups that already experiment with semi-supervised or embedding-based transfer. A reader hunting for new low-supervision ideas could extract useful concepts even if the current implementation needs tightening. It is worth sending to peer review so that referees can check the actual derivations, training details, and result robustness rather than desk-rejecting an approach that at least tries to solve a stubborn practical bottleneck.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a semi-supervised framework called Semantic-Emotional Resonance Embedding (SERE) for cross-lingual speech emotion recognition. It builds an emotion-semantic structure from a small number (5-shot) of labeled source-language samples, then uses an Instantaneous Resonance Field (IRF) to let unlabeled target-language samples self-organize into that structure without target labels or translation alignment. A Triple-Resonance Interaction Chain (TRIC) loss is introduced to strengthen interactions between labeled and unlabeled samples at emotional highlights. Experiments across multiple languages are claimed to demonstrate the approach's effectiveness.

Significance. If the IRF and TRIC mechanisms can be shown to produce reliable language-invariant self-organization from source-only labels, the work would provide a practical route to high-performance CLSER in low-resource settings, substantially lowering the labeling burden for target languages.

major comments (2)

[Abstract] Abstract: the central claim that IRF enables unlabeled target samples to self-organize correctly into a source-derived emotion-semantic structure is asserted without any equations, training details, ablation results, or error bars. This leaves the performance claims without visible derivation or empirical support.
[Method] Method (IRF and TRIC definitions): no explicit cross-lingual invariance mechanism, shared embedding layer, or contrastive term is supplied to guarantee that the resonance field extracts language-invariant emotional features from raw speech. If the underlying acoustic space remains language-specific, the interaction chain cannot produce reliable self-organization and the semi-supervised transfer collapses.

minor comments (1)

[Abstract] The abstract and introduction introduce multiple new acronyms (SERE, IRF, TRIC) without immediate reference to prior literature on resonance-based or interaction-chain models in speech processing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript proposing the SERE framework for cross-lingual speech emotion recognition. We address each major comment point by point below, indicating where revisions will be made to strengthen clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that IRF enables unlabeled target samples to self-organize correctly into a source-derived emotion-semantic structure is asserted without any equations, training details, ablation results, or error bars. This leaves the performance claims without visible derivation or empirical support.

Authors: The abstract is a concise high-level summary by design and does not contain technical derivations. The full manuscript supplies the IRF and TRIC equations in Section 3, training details in Section 4, ablation studies in Section 5.3, and error bars on all reported results in the experimental figures. We will revise the abstract to include a brief reference to these mechanisms and their empirical validation to better support the central claim. revision: partial
Referee: [Method] Method (IRF and TRIC definitions): no explicit cross-lingual invariance mechanism, shared embedding layer, or contrastive term is supplied to guarantee that the resonance field extracts language-invariant emotional features from raw speech. If the underlying acoustic space remains language-specific, the interaction chain cannot produce reliable self-organization and the semi-supervised transfer collapses.

Authors: We agree that an explicit statement of the invariance mechanism would improve the presentation. The IRF achieves language-invariance by operating on instantaneous resonance at emotional highlights rather than language-specific acoustics, and TRIC reinforces this via interaction chains between source-labeled and target-unlabeled samples. We will add a dedicated paragraph in the revised Method section to explicitly describe these invariance properties, including any shared components or interaction terms. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in claimed framework

full rationale

The paper proposes an empirical semi-supervised method (SERE with IRF and TRIC loss) that constructs an emotion-semantic structure from 5-shot source labels and applies losses to unlabeled target samples. Claims of effectiveness rest on experimental results across languages rather than a closed mathematical derivation or prediction that reduces to its inputs by construction. No equations or sections are shown that equate the self-organization outcome to a fitted parameter or self-citation chain; the framework is presented as a design choice validated externally via benchmarks. This is the common non-circular case for applied ML papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract-only view reveals heavy reliance on newly introduced constructs whose mathematical definitions and external grounding are not supplied.

invented entities (3)

Semantic-Emotional Resonance Embedding (SERE) no independent evidence
purpose: Cross-lingual dynamic feature paradigm for semi-supervised emotion recognition
Core proposed framework that constructs emotion-semantic structure from limited labels.
Instantaneous Resonance Field (IRF) no independent evidence
purpose: Mechanism allowing unlabeled samples to self-organize into the emotion-semantic structure
New field concept enabling structural discovery without supervision.
Triple-Resonance Interaction Chain (TRIC) loss no independent evidence
purpose: Loss that reinforces interaction between labeled and unlabeled samples at emotional highlights
Custom training objective introduced to strengthen embedding capabilities.

pith-pipeline@v0.9.0 · 5479 in / 1382 out tokens · 43866 ms · 2026-05-10T17:14:52.532905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Self supervised adversarial do- main adaptation for cross-corpus and cross-language speech emotion recognition,

S. Latif, R. Rana, S. Khalifaet al., “Self supervised adversarial do- main adaptation for cross-corpus and cross-language speech emotion recognition,”IEEE Transactions on Affective Computing, vol. 14, pp. 1912–1926, 2022

work page 1912
[2]

Is the putative mirror neuron system associated with empathy? a systematic review and meta- analysis,

S. Bekkali, G. J. Youssef, P. H. Donaldsonet al., “Is the putative mirror neuron system associated with empathy? a systematic review and meta- analysis,”Neuropsychology review, vol. 31, pp. 14–57, 2021

work page 2021
[3]

Coarse alignment of topic and sentiment: A unified model for cross-lingual sentiment classification,

D. Wang, B. Jing, C. Luet al., “Coarse alignment of topic and sentiment: A unified model for cross-lingual sentiment classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 736–747, 2020

work page 2020
[4]

Leveraging cross-attention trans- former and multi-feature fusion for cross-linguistic speech emotion recognition,

R. Zhao, X. Jiang, F. R. Yuet al., “Leveraging cross-attention trans- former and multi-feature fusion for cross-linguistic speech emotion recognition,”IEEE Internet of Things Journal, 2025

work page 2025
[5]

Weavenet: End-to-end audiovisual sentiment analysis,

Y . Yu, Z. Jiaet al., “Weavenet: End-to-end audiovisual sentiment analysis,” inInternational Conference on Cognitive Systems and Signal Processing, 2021, pp. 3–16

work page 2021
[6]

Mixmatch: A holistic approach to semi-supervised learning,

D. Berthelot, N. Carlini, I. Goodfellowet al., “Mixmatch: A holistic approach to semi-supervised learning,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[7]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,

D.-H. Leeet al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” inWorkshop on challenges in representation learning, ICML, vol. 3, no. 2. Atlanta, 2013, p. 896

work page 2013
[8]

Semi-supervised semantic segmentation with prototype-based consistency regularization,

H. Xu, L. Liu, Q. Bianet al., “Semi-supervised semantic segmentation with prototype-based consistency regularization,”Advances in neural information processing systems, vol. 35, pp. 26 007–26 020, 2022

work page 2022
[9]

Graph random neural networks for semi-supervised learning on graphs,

W. Feng, J. Zhang, Y . Donget al., “Graph random neural networks for semi-supervised learning on graphs,”Advances in neural information processing systems, vol. 33, pp. 22 092–22 103, 2020

work page 2020
[10]

Cross-corpus speech emotion recognition using joint distribution adaptive regression,

J. Zhang, L. Jiang, Y . Zonget al., “Cross-corpus speech emotion recognition using joint distribution adaptive regression,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3790–3794

work page 2021
[11]

Implicitly aligning joint distributions for cross-corpus speech emotion recognition,

C. Lu, Y . Zong, C. Tanget al., “Implicitly aligning joint distributions for cross-corpus speech emotion recognition,”Electronics, vol. 11, p. 2745, 2022

work page 2022
[12]

Classification inconsistency alignment network for cross-corpus speech emotion recognition,

X. Zhou, J. Li, Q. Yuet al., “Classification inconsistency alignment network for cross-corpus speech emotion recognition,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[13]

Low-rank joint distribution adaptation for cross-corpus speech emotion recognition,

S. Li, C. Lu, Y . Zhaoet al., “Low-rank joint distribution adaptation for cross-corpus speech emotion recognition,”Knowledge-Based Systems, vol. 315, p. 113260, 2025

work page 2025
[14]

Adversarial domain generalized trans- former for cross-corpus speech emotion recognition,

Y . Gao, L. Wang, J. Liuet al., “Adversarial domain generalized trans- former for cross-corpus speech emotion recognition,”IEEE Transactions on Affective Computing, vol. 15, pp. 697–708, 2023

work page 2023
[15]

Bootstrap your own latent-a new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´eet al., “Bootstrap your own latent-a new approach to self-supervised learning,”Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020

work page 2020
[16]

Exploring simple siamese representation learning,

X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 750–15 758

work page 2021
[17]

An adaptation framework with unified embedding reconstruction for cross-corpus speech emotion recognition,

R. Zhang, J. Wei, X. Luet al., “An adaptation framework with unified embedding reconstruction for cross-corpus speech emotion recognition,” Applied Soft Computing, vol. 174, p. 112948, 2025

work page 2025
[18]

Learning transferable features with deep adaptation networks,

M. Long, Y . Cao, J. Wanget al., “Learning transferable features with deep adaptation networks,” inInternational conference on machine learning. PMLR, 2015, pp. 97–105

work page 2015
[19]

Exploiting the intrinsic neighborhood structure for source-free domain adaptation,

S. Yang, J. Van de Weijer, L. Herranzet al., “Exploiting the intrinsic neighborhood structure for source-free domain adaptation,”Advances in neural information processing systems, pp. 29 393–29 405, 2021

work page 2021
[20]

Attracting and dispersing: A simple approach for source-free domain adaptation,

S. Yang, S. Jui, J. Van De Weijeret al., “Attracting and dispersing: A simple approach for source-free domain adaptation,”Advances in Neural Information Processing Systems, vol. 35, pp. 5802–5815, 2022

work page 2022
[21]

Emotion-aware contrastive adaptation network for source-free cross-corpus speech emotion recognition,

Y . Zhao, J. Wang, C. Luet al., “Emotion-aware contrastive adaptation network for source-free cross-corpus speech emotion recognition,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 846–11 850

work page 2024
[22]

A database of german emotional speech

F. Burkhardt, A. Paeschke, M. Rolfeset al., “A database of german emotional speech.” inInterspeech, vol. 5, 2005, pp. 1517–1520

work page 2005
[23]

The enterface’05 audio-visual emotion database,

O. Martin, I. Kotsia, B. Macqet al., “The enterface’05 audio-visual emotion database,” in22nd international conference on data engineering workshops (ICDEW’06). IEEE, 2006, pp. 8–8

work page 2006
[24]

Design of speech corpus for mandarin text to speech,

J. Zhang and H. Jia, “Design of speech corpus for mandarin text to speech,” inThe blizzard challenge 2008 workshop, 2008

work page 2008
[25]

Emovo corpus: an italian emotional speech database,

G. Costantini, I. Iaderola, A. Paoloniet al., “Emovo corpus: an italian emotional speech database,” inProceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), 2014, pp. 3501–3504

work page 2014
[26]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamedet al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 12 449–12 460

work page 2020
[27]

Crepe: A convolutional represen- tation for pitch estimation,

J. W. Kim, J. Salamon, P. Liet al., “Crepe: A convolutional represen- tation for pitch estimation,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 161–165

work page 2018
[28]

librosa: Audio and music signal analysis in python

B. McFee, C. Raffel, D. Lianget al., “librosa: Audio and music signal analysis in python.”SciPy, vol. 2015, pp. 18–24, 2015

work page 2015

[1] [1]

Self supervised adversarial do- main adaptation for cross-corpus and cross-language speech emotion recognition,

S. Latif, R. Rana, S. Khalifaet al., “Self supervised adversarial do- main adaptation for cross-corpus and cross-language speech emotion recognition,”IEEE Transactions on Affective Computing, vol. 14, pp. 1912–1926, 2022

work page 1912

[2] [2]

Is the putative mirror neuron system associated with empathy? a systematic review and meta- analysis,

S. Bekkali, G. J. Youssef, P. H. Donaldsonet al., “Is the putative mirror neuron system associated with empathy? a systematic review and meta- analysis,”Neuropsychology review, vol. 31, pp. 14–57, 2021

work page 2021

[3] [3]

Coarse alignment of topic and sentiment: A unified model for cross-lingual sentiment classification,

D. Wang, B. Jing, C. Luet al., “Coarse alignment of topic and sentiment: A unified model for cross-lingual sentiment classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 736–747, 2020

work page 2020

[4] [4]

Leveraging cross-attention trans- former and multi-feature fusion for cross-linguistic speech emotion recognition,

R. Zhao, X. Jiang, F. R. Yuet al., “Leveraging cross-attention trans- former and multi-feature fusion for cross-linguistic speech emotion recognition,”IEEE Internet of Things Journal, 2025

work page 2025

[5] [5]

Weavenet: End-to-end audiovisual sentiment analysis,

Y . Yu, Z. Jiaet al., “Weavenet: End-to-end audiovisual sentiment analysis,” inInternational Conference on Cognitive Systems and Signal Processing, 2021, pp. 3–16

work page 2021

[6] [6]

Mixmatch: A holistic approach to semi-supervised learning,

D. Berthelot, N. Carlini, I. Goodfellowet al., “Mixmatch: A holistic approach to semi-supervised learning,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[7] [7]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,

D.-H. Leeet al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” inWorkshop on challenges in representation learning, ICML, vol. 3, no. 2. Atlanta, 2013, p. 896

work page 2013

[8] [8]

Semi-supervised semantic segmentation with prototype-based consistency regularization,

H. Xu, L. Liu, Q. Bianet al., “Semi-supervised semantic segmentation with prototype-based consistency regularization,”Advances in neural information processing systems, vol. 35, pp. 26 007–26 020, 2022

work page 2022

[9] [9]

Graph random neural networks for semi-supervised learning on graphs,

W. Feng, J. Zhang, Y . Donget al., “Graph random neural networks for semi-supervised learning on graphs,”Advances in neural information processing systems, vol. 33, pp. 22 092–22 103, 2020

work page 2020

[10] [10]

Cross-corpus speech emotion recognition using joint distribution adaptive regression,

J. Zhang, L. Jiang, Y . Zonget al., “Cross-corpus speech emotion recognition using joint distribution adaptive regression,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3790–3794

work page 2021

[11] [11]

Implicitly aligning joint distributions for cross-corpus speech emotion recognition,

C. Lu, Y . Zong, C. Tanget al., “Implicitly aligning joint distributions for cross-corpus speech emotion recognition,”Electronics, vol. 11, p. 2745, 2022

work page 2022

[12] [12]

Classification inconsistency alignment network for cross-corpus speech emotion recognition,

X. Zhou, J. Li, Q. Yuet al., “Classification inconsistency alignment network for cross-corpus speech emotion recognition,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[13] [13]

Low-rank joint distribution adaptation for cross-corpus speech emotion recognition,

S. Li, C. Lu, Y . Zhaoet al., “Low-rank joint distribution adaptation for cross-corpus speech emotion recognition,”Knowledge-Based Systems, vol. 315, p. 113260, 2025

work page 2025

[14] [14]

Adversarial domain generalized trans- former for cross-corpus speech emotion recognition,

Y . Gao, L. Wang, J. Liuet al., “Adversarial domain generalized trans- former for cross-corpus speech emotion recognition,”IEEE Transactions on Affective Computing, vol. 15, pp. 697–708, 2023

work page 2023

[15] [15]

Bootstrap your own latent-a new approach to self-supervised learning,

J.-B. Grill, F. Strub, F. Altch ´eet al., “Bootstrap your own latent-a new approach to self-supervised learning,”Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020

work page 2020

[16] [16]

Exploring simple siamese representation learning,

X. Chen and K. He, “Exploring simple siamese representation learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 750–15 758

work page 2021

[17] [17]

An adaptation framework with unified embedding reconstruction for cross-corpus speech emotion recognition,

R. Zhang, J. Wei, X. Luet al., “An adaptation framework with unified embedding reconstruction for cross-corpus speech emotion recognition,” Applied Soft Computing, vol. 174, p. 112948, 2025

work page 2025

[18] [18]

Learning transferable features with deep adaptation networks,

M. Long, Y . Cao, J. Wanget al., “Learning transferable features with deep adaptation networks,” inInternational conference on machine learning. PMLR, 2015, pp. 97–105

work page 2015

[19] [19]

Exploiting the intrinsic neighborhood structure for source-free domain adaptation,

S. Yang, J. Van de Weijer, L. Herranzet al., “Exploiting the intrinsic neighborhood structure for source-free domain adaptation,”Advances in neural information processing systems, pp. 29 393–29 405, 2021

work page 2021

[20] [20]

Attracting and dispersing: A simple approach for source-free domain adaptation,

S. Yang, S. Jui, J. Van De Weijeret al., “Attracting and dispersing: A simple approach for source-free domain adaptation,”Advances in Neural Information Processing Systems, vol. 35, pp. 5802–5815, 2022

work page 2022

[21] [21]

Emotion-aware contrastive adaptation network for source-free cross-corpus speech emotion recognition,

Y . Zhao, J. Wang, C. Luet al., “Emotion-aware contrastive adaptation network for source-free cross-corpus speech emotion recognition,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 846–11 850

work page 2024

[22] [22]

A database of german emotional speech

F. Burkhardt, A. Paeschke, M. Rolfeset al., “A database of german emotional speech.” inInterspeech, vol. 5, 2005, pp. 1517–1520

work page 2005

[23] [23]

The enterface’05 audio-visual emotion database,

O. Martin, I. Kotsia, B. Macqet al., “The enterface’05 audio-visual emotion database,” in22nd international conference on data engineering workshops (ICDEW’06). IEEE, 2006, pp. 8–8

work page 2006

[24] [24]

Design of speech corpus for mandarin text to speech,

J. Zhang and H. Jia, “Design of speech corpus for mandarin text to speech,” inThe blizzard challenge 2008 workshop, 2008

work page 2008

[25] [25]

Emovo corpus: an italian emotional speech database,

G. Costantini, I. Iaderola, A. Paoloniet al., “Emovo corpus: an italian emotional speech database,” inProceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), 2014, pp. 3501–3504

work page 2014

[26] [26]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamedet al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 12 449–12 460

work page 2020

[27] [27]

Crepe: A convolutional represen- tation for pitch estimation,

J. W. Kim, J. Salamon, P. Liet al., “Crepe: A convolutional represen- tation for pitch estimation,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 161–165

work page 2018

[28] [28]

librosa: Audio and music signal analysis in python

B. McFee, C. Raffel, D. Lianget al., “librosa: Audio and music signal analysis in python.”SciPy, vol. 2015, pp. 18–24, 2015

work page 2015