arxiv: 2604.12633 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

Vadim Borisov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords multilingual emotion classificationsynthetic datamulti-label classificationzero-shot evaluationtransformer encoderscross-lingual transferemotion detectionculturally adapted generation

0 comments

The pith

A multilingual emotion classifier trained on synthetic data across 23 languages matches English-only models on human benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Emotion classification has long been limited by the scarcity of annotated data, which is mostly English, single-label, and restricted to a few languages. This paper generates a synthetic corpus of over one million multi-label emotion examples in 23 languages through culturally adapted generation and quality filtering. Models trained on this data, especially XLM-R Large, reach strong results on the synthetic test set and match or exceed English specialist models on zero-shot evaluation against human-annotated benchmarks like GoEmotions and SemEval-2018. The result matters because it shows a path to emotion classifiers that work across many languages without needing new human labels for each one.

Core claim

The authors build a synthetic training set of more than 1M multi-label samples covering 11 emotions in 23 languages and demonstrate that transformer models trained on it deliver competitive zero-shot performance on real human-annotated datasets while natively supporting all 23 languages; XLM-R Large ties on AP-micro and LRAP and surpasses on AUC-micro compared with English-only specialists.

What carries the argument

The large-scale synthetic corpus created by culturally-adapted generation followed by programmatic quality filtering.

If this is right

Emotion classifiers become feasible for many languages without language-specific human annotation.
A single model can handle multi-label emotion detection consistently across 23 languages.
Zero-shot transfer from synthetic data to human test sets is viable when generation includes cultural adaptation.
The same data-construction method supports scaling to additional languages or emotion categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other scarce-annotation tasks such as toxicity or intent classification in multiple languages.
Increasing language coverage further while keeping the same generation pipeline might reveal where cultural adaptation begins to fail.
Public release of the best base-sized model allows direct use and further fine-tuning on new languages.

Load-bearing premise

The synthetic examples carry emotional labels and cultural fidelity close enough to real human text that models trained on them generalize to human-annotated benchmarks.

What would settle it

Testing the released model on a fresh human-annotated multi-label emotion dataset in one of the covered languages and observing performance well below English specialist baselines would show the synthetic data did not generalize.

Figures

Figures reproduced from arXiv: 2604.12633 by Vadim Borisov.

**Figure 2.** Figure 2: Per-language F1-micro on the test set. Languages are sorted by mean F1 across models [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Compute vs. quality Pareto frontier. Training time (min, log scale) vs. test Jaccard. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Micro-averaged precision-recall curves on the in-domain test set. AP-micro values are [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Emotion classification in multilingual settings remains constrained by the scarcity of annotated data: existing corpora are predominantly English, single-label, and cover few languages. We address this gap by constructing a large-scale synthetic training corpus of over 1M multi-label samples (50k per language) across 23 languages: Arabic, Bengali, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin, Polish, Portuguese, Punjabi, Russian, Spanish, Swahili, Tamil, Turkish, Ukrainian, Urdu, and Vietnamese, covering 11 emotion categories using culturally-adapted generation and programmatic quality filtering. We train and compare six multilingual transformer encoders, from DistilBERT (135M parameters) to XLM-R-Large (560M parameters), under identical conditions. On our in-domain test set, XLM-R-Large achieves 0.868 F1-micro and 0.987 AUC-micro. To validate against human-annotated data, we evaluate all models zero-shot on GoEmotions (English) and SemEval-2018 Task 1 E-c (English, Arabic, Spanish). On threshold-free ranking metrics, XLM-R-Large matches or exceeds English-only specialist models, tying on AP-micro (0.636) and LRAP (0.804) while surpassing on AUC-micro (0.810 vs. 0.787), while natively supporting all 23 languages. The best base-sized model is publicly available at https://huggingface.co/tabularisai/multilingual-emotion-classification

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scales up synthetic multilingual multi-label emotion data to 1M samples and shows competitive zero-shot transfer on a couple benchmarks, but the label quality claim rests on unverified assumptions.

read the letter

The core thing to know is that this work creates a 1M-sample synthetic corpus covering 23 languages with 11 emotions in multi-label format, using culturally adapted generation plus filtering, then trains standard transformers and tests zero-shot on GoEmotions and SemEval-2018. XLM-R-Large reaches parity with English specialist models on AP-micro, LRAP, and AUC-micro while supporting all the languages natively. That scale and the public model release are the practical advances over earlier smaller or monolingual synthetic efforts.

Referee Report

2 major / 2 minor

Summary. The paper constructs a synthetic multi-label emotion dataset of over 1M examples (50k per language) across 23 languages and 11 emotion categories via culturally-adapted generation and programmatic filtering. Six multilingual encoders (DistilBERT to XLM-R-Large) are trained under identical conditions. In-domain results reach 0.868 F1-micro and 0.987 AUC-micro for XLM-R-Large. Zero-shot transfer to GoEmotions and SemEval-2018 Task 1 E-c yields competitive or superior threshold-free ranking metrics (AP-micro 0.636, LRAP 0.804, AUC-micro 0.810) compared to English-only specialists, with native support for all 23 languages. The best model is released publicly.

Significance. If the synthetic data's labels and cultural fidelity transfer as claimed, the work offers a scalable route to multilingual emotion classification without large-scale human annotation, directly addressing data scarcity for 22 non-English languages. The in-domain performance is strong, the public model release is a clear asset, and the multi-label, threshold-free evaluation provides useful benchmarks. The approach could influence data-generation pipelines in low-resource NLP more broadly.

major comments (2)

[Synthetic data generation and filtering] The zero-shot generalization claim (XLM-R-Large matching or exceeding English specialists on GoEmotions/SemEval-2018 ranking metrics) is load-bearing and rests on the unverified assumption that synthetic labels match human distributions. No human re-annotation of synthetic samples, inter-annotator agreement statistics, or error analysis on synthetic-vs-real label mismatches is reported.
[Zero-shot evaluation on GoEmotions and SemEval-2018] Filtering thresholds are listed as free parameters yet no justification, sensitivity analysis, or ablation isolating their effect on label co-occurrence distributions is provided. This directly affects confidence in the cross-dataset transfer results.

minor comments (2)

[Abstract] The abstract states 'over 1M' samples but the per-language figure (50k × 23) implies ~1.15M; a precise total would improve clarity.
[Introduction] The choice of the 11 emotion categories and their mapping to the target benchmarks could be cross-referenced more explicitly to prior taxonomies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below with honest responses and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The zero-shot generalization claim (XLM-R-Large matching or exceeding English specialists on GoEmotions/SemEval-2018 ranking metrics) is load-bearing and rests on the unverified assumption that synthetic labels match human distributions. No human re-annotation of synthetic samples, inter-annotator agreement statistics, or error analysis on synthetic-vs-real label mismatches is reported.

Authors: We agree that direct human validation of the synthetic labels would provide stronger corroboration. The manuscript relies on zero-shot transfer performance as indirect evidence of label utility, since models trained solely on the synthetic data achieve competitive or superior ranking metrics on human-annotated benchmarks without any target-domain fine-tuning. We will revise the paper to include: (i) expanded discussion of the culturally-adapted generation and programmatic filtering process, (ii) distributional comparison of emotion co-occurrences between the synthetic corpus and the GoEmotions/SemEval-2018 test sets, and (iii) an explicit limitations paragraph acknowledging the absence of human re-annotation and inter-annotator agreement on the synthetic data. Full-scale human re-annotation of >1M examples is not feasible within current resources, but the added analysis will better contextualize the claims. revision: partial
Referee: Filtering thresholds are listed as free parameters yet no justification, sensitivity analysis, or ablation isolating their effect on label co-occurrence distributions is provided. This directly affects confidence in the cross-dataset transfer results.

Authors: We acknowledge that the filtering thresholds were presented without sufficient supporting analysis. These thresholds were selected via preliminary tuning to retain label diversity while removing low-quality generations, but no sensitivity or ablation results were included. In the revised manuscript we will add a new subsection (or appendix) reporting sensitivity analysis across threshold values, quantifying their impact on label co-occurrence statistics, label cardinality, and downstream zero-shot performance on both GoEmotions and SemEval-2018. This will directly address the concern about robustness of the cross-dataset transfer results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline grounded in external human-annotated benchmarks

full rationale

The paper constructs a synthetic training corpus via culturally-adapted generation and programmatic filtering, trains multilingual encoders, and reports in-domain metrics plus zero-shot transfer to independently collected human-annotated datasets (GoEmotions and SemEval-2018 Task 1). No equations, fitted parameters, or self-citations reduce any reported prediction or ranking metric to the synthetic inputs by construction; the external benchmarks supply independent grounding. The central claims therefore remain falsifiable against real distributions and do not collapse into self-definition or renaming.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work relies on the assumption that LLM-generated text with programmatic filters can stand in for human annotations; no new physical or mathematical entities are introduced.

free parameters (2)

filtering thresholds
Programmatic quality filters use unspecified cutoffs that determine which generated samples are kept.
emotion category definitions
The 11 emotion categories and their cultural adaptations are chosen by the authors.

axioms (2)

domain assumption LLM-generated text can be made culturally appropriate via prompt engineering
Invoked in the description of culturally-adapted generation.
domain assumption Programmatic filters remove low-quality or mislabeled examples without introducing systematic bias
Central to the data construction pipeline.

pith-pipeline@v0.9.0 · 5574 in / 1417 out tokens · 48092 ms · 2026-05-10T15:36:00.947278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Text-based emotion classification using emotion cause extraction

Weiyuan Li and Hua Xu. Text-based emotion classification using emotion cause extraction. Expert Systems with Applications, 41(4):1742–1749, 2014

2014
[2]

Emotion analysis in nlp: Trends, gaps and roadmap for future directions

Flor Miriam Plaza-del Arco, Alba A Cercas Curry, Amanda Cercas Curry, and Dirk Hovy. Emotion analysis in nlp: Trends, gaps and roadmap for future directions. InProceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pages 5696–5710, 2024

2024
[3]

GoEmotions: A dataset of fine-grained emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. GoEmotions: A dataset of fine-grained emotions. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054. Association for Computational Linguistics, 2020

2020
[4]

SemEval-2018 Task 1: Affect in tweets

Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. SemEval-2018 Task 1: Affect in tweets. InProceedings of the 12th International Workshop on Semantic Evaluation, pages 1–17. Association for Computational Linguistics, 2018

2018
[5]

Emotion semantics show both cultural variation and universal structure.Science, 366(6472):1517– 1522, 2019

Joshua Conrad Jackson, Joseph Watts, Teague R Henry, Johann-Mattis List, Robert Forkel, Peter J Mucha, Simon J Greenhill, Russell D Gray, and Kristen A Lindquist. Emotion semantics show both cultural variation and universal structure.Science, 366(6472):1517– 1522, 2019

2019
[6]

An argument for basic emotions.Cognition & Emotion, 6(3–4):169–200, 1992

Paul Ekman. An argument for basic emotions.Cognition & Emotion, 6(3–4):169–200, 1992

1992
[7]

Modeling romanized hindi and bengali: Dataset creation and mul- tilingual llm integration.arXiv preprint arXiv:2511.22769, 2025

Kanchon Gharami, Quazi Sarwar Muhtaseem, Deepti Gupta, Lavanya Elluri, and Shafika Showkat Moni. Modeling romanized hindi and bengali: Dataset creation and mul- tilingual llm integration.arXiv preprint arXiv:2511.22769, 2025

work page arXiv 2025
[8]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.ArXiv, abs/1910.01108, 2019

work page internal anchor Pith review arXiv 1910
[9]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages4171–4186.AssociationforComputationalLinguistics, 2019

2019
[10]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, pages8440–8451.Association for ...

2020
[11]

XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond

Francesco Barbieri, Luis Espinosa Anke, and Jose Camacho-Collados. XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. InProceedings of the Thir- teenth Language Resources and Evaluation Conference, pages 258–266, Marseille, France, June 2022. European Language Resources Association

2022
[12]

Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021

Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021. 8

2021
[13]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Do chatbot llms talk too much? the yapbench benchmark.arXiv preprint arXiv:2601.00624, 2026

Vadim Borisov, Michael Gröger, Mina Mikhael, and Richard H Schreiber. Do chatbot llms talk too much? the yapbench benchmark.arXiv preprint arXiv:2601.00624, 2026

work page arXiv 2026
[15]

Open artificial knowledge.arXiv preprint arXiv:2407.14371, 2024

Vadim Borisov and Richard H Schreiber. Open artificial knowledge.arXiv preprint arXiv:2407.14371, 2024

work page arXiv 2024
[16]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-Instruct: Aligning language models with self- generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 13484–13508. Association for Computational Linguistics, 2023

2023
[17]

AugGPT: Leveraging ChatGPT for text data augmentation.arXiv preprint arXiv:2302.13007, 2023

Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zhi Wu, Lin Zhao, Shaochen Xu, Wei Liu, Ninghao Liu, et al. AugGPT: Leveraging ChatGPT for text data augmentation.arXiv preprint arXiv:2302.13007, 2023

work page arXiv 2023
[18]

Synthetic data generation pipeline for low-resource swahili sentiment analy- sis: Multi-llm judging with human validation

Samuel Gyamfi, Alfred Malengo Kondoro, Yankı Öztürk, Richard Hans Schreiber, and Vadim Borisov. Synthetic data generation pipeline for low-resource swahili sentiment analy- sis: Multi-llm judging with human validation. InProceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026), pages 116–141, 2026

2026
[19]

DeBERTa: Decoding- enhanced BERT with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding- enhanced BERT with disentangled attention. InProceedings of the International Conference on Learning Representations, 2021

2021
[20]

Xtreme: A massively multilingual multi-task benchmark for evaluating cross- lingual generalisation

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross- lingual generalisation. InInternational conference on machine learning, pages 4411–4421. PMLR, 2020. 9 A Label Mappings A.1 GoEmotions→Our Taxonomy (28→11) Our label GoEmotions labels mapp...

2020