arxiv: 2604.17134 · v1 · submitted 2026-04-18 · 💻 cs.CL

Recognition: unknown

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian

Andrei-Marius Avram , Aureliu Valentin Antonie , Cosmin-Mircea Croitoru , Vlad Andrei Muntean , Dumitru-Clementin Cercel

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysismultilingual datasetadversarial trainingcross-lingualcross-domainRomanianItalianXLM-R

0 comments

The pith

RoIt-XMASA dataset with meta-learned adversarial training lets XLM-R reach 66.23% F1 in cross-lingual and cross-domain sentiment analysis for Italian and Romanian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoIt-XMASA, a dataset extending existing Amazon reviews to include 36,000 labeled Italian and Romanian samples across books, movies, and music plus over 200,000 unlabeled ones. It proposes a multi-target adversarial training framework that reverses losses on domain and language prediction while using meta-learned coefficients to keep sentiment learning strong. This yields an F1-score of 66.23% for XLM-R, 4.64 points above the baseline. Few-shot tests with Llama-3.1-8B reach only 58.43%, showing a gap between specialized fine-tuning and prompting. A reader would care because the work supplies data and a balancing method for sentiment tools that work across languages and topics without separate models for each.

Core claim

We present the RoIt-XMASA dataset extending cross-lingual multi-domain sentiment analysis to Italian and Romanian with 36,000 labeled and 202,141 unlabeled reviews in books, movies, and music. Our multi-target adversarial training framework uses loss reversal with meta-learned coefficients to balance sentiment classification against domain and language invariance. This allows XLM-R to achieve an F1-score of 66.23%, which is 4.64% better than the baseline, while few-shot prompting of Llama-3.1-8B yields 58.43% F1-score.

What carries the argument

Multi-target adversarial training framework with loss reversal and meta-learned coefficients for balancing sentiment discrimination, domain invariance, and language invariance.

If this is right

Higher accuracy on cross-lingual and cross-domain sentiment classification for Italian and Romanian.
Meta-learning provides a workable way to tune multiple adversarial objectives at once.
Fine-tuned models outperform few-shot large language models on this task by a clear margin.
New labeled and unlabeled resources exist for testing future multilingual models in these languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset construction and balancing method could extend to other language pairs or additional domains.
The approach may apply to other tasks where models must ignore certain attributes like topic while focusing on a target label.
Larger models or refined prompting could narrow the gap to fine-tuned performance in future experiments.
Uniform models trained this way could simplify real-world analysis of customer feedback across languages and product types.

Load-bearing premise

The meta-learned coefficients successfully balance sentiment discrimination against domain and language invariance without causing training instability or overfitting on the specific dataset splits.

What would settle it

Retraining XLM-R on the same splits with fixed rather than meta-learned coefficients and finding no statistically significant F1 improvement over the baseline would show the meta-learning step adds no value.

Figures

Figures reproduced from arXiv: 2604.17134 by Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, Dumitru-Clementin Cercel, Vlad Andrei Muntean.

**Figure 1.** Figure 1: Rating distribution across the RoIt-XMASA dataset. Left: overall rating distribution; Center: rating distribution by [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Token distribution histograms for the text and title fields. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Token distribution of the text in RoIt-XMASA, grouped by language. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Token distribution of the title in RoIt-XMASA, grouped by language. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Token distribution of the text in RoIt-XMASA, grouped by domain. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Token distribution of the title in RoIt-XMASA, grouped by domain. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

We present RoIt-XMASA, a multilingual dataset that extends the Cross-lingual Multi-domain Amazon Sentiment Analysis to Italian and Romanian, comprising 36,000 labeled reviews across three domains (books, movies, and music) and 202,141 unlabeled samples. To address cross-lingual and cross-domain challenges, we propose a multi-target adversarial training framework that employs loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination with domain and language invariance. XLM-R achieves an F1-score of 66.23% with our approach, outperforming the baseline by 4.64%. Few-shot evaluation shows that Llama-3.1-8B achieves 58.43% F1-score, revealing a meaningful trade-off between the efficiency of prompting-based approaches and the higher performance of task-specific fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper mainly delivers a new labeled dataset for Romanian and Italian sentiment analysis with a modest adversarial tweak that needs more checks to confirm the reported gains.

read the letter

The core contribution is the RoIt-XMASA dataset, which extends the earlier XMASA collection to Romanian and Italian with 36,000 labeled reviews across books, movies, and music plus over 200,000 unlabeled samples. They also outline a multi-target adversarial setup that uses loss reversal and meta-learned coefficients to balance sentiment accuracy against domain and language invariance, reporting a 4.64% F1 lift for XLM-R to 66.23% along with a few-shot Llama-3.1-8B baseline at 58.43% F1. The dataset itself is the clearest addition here, filling a practical gap for two languages that have limited public resources in this area. The training framework is a concrete implementation detail rather than a wholly new idea, but it is applied in a multi-domain, multi-language setting that matches the data they built. The numbers are presented plainly and the few-shot comparison gives a sense of trade-offs between fine-tuning and prompting. The main soft spots sit in the experimental reporting. The abstract and available details give no error bars, no seed variance, no coefficient trajectories during training, and no ablations that isolate the meta-learning part from other choices. Without those, the 4.64% margin is plausible but hard to judge as robust rather than tied to the specific 36k splits. The stress-test concern about overfitting on the meta coefficients is reasonable given the lack of dynamics or stability evidence. This work is aimed at researchers who need labeled data for Romanian or Italian sentiment or who run cross-lingual domain adaptation experiments. It is not positioned as a theoretical advance. The dataset and basic results are solid enough to warrant a serious referee, who would likely request the missing robustness checks and full protocol. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoIt-XMASA, a multilingual multi-domain sentiment analysis dataset extending prior Amazon review resources to Romanian and Italian, with 36,000 labeled reviews across books/movies/music domains plus 202,141 unlabeled samples. It proposes a multi-target adversarial training method for XLM-R that applies loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination against domain and language invariance, reporting 66.23% F1 (4.64% above baseline) and a few-shot Llama-3.1-8B result of 58.43% F1.

Significance. If validated, the dataset provides a useful new resource for cross-lingual and cross-domain sentiment analysis in lower-resource Romance languages, and the adversarial framework with meta-learned balancing offers a potentially generalizable technique for multi-objective invariance. The efficiency comparison to prompting-based LLMs is also informative. The work earns credit for dataset scale and the inclusion of both fine-tuning and few-shot evaluations.

major comments (2)

[Method and Results sections (as described in abstract)] The central performance claim (XLM-R at 66.23% F1 via the proposed multi-target adversarial training, 4.64% above baseline) depends on the meta-learned coefficients successfully balancing the objectives without overfitting to the specific 36k labeled splits. The method description indicates dynamic balancing via loss reversal, but no coefficient trajectories, training dynamics, seed-wise variance, or ablation isolating the meta-learning component from other training choices are provided; this leaves the attribution of the gain unverified and is load-bearing for the main result.
[Experimental evaluation (abstract and results)] No error bars, statistical significance tests, or full experimental protocol (including exact train/val/test partitions, hyperparameter search, and convergence criteria) accompany the reported F1 scores or the few-shot evaluation; this undermines assessment of whether the 4.64% margin is reproducible or statistically meaningful.

minor comments (2)

[Dataset description] Dataset construction details (e.g., annotation process, domain balance, and how the 202k unlabeled samples are used) would benefit from a dedicated table or subsection for clarity.
[Method] Notation for the meta-learned coefficients and the precise form of the multi-target loss could be formalized with equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below regarding verification of the performance gains and experimental reproducibility. We will incorporate additional analyses and details in the revised version to strengthen the paper.

read point-by-point responses

Referee: [Method and Results sections (as described in abstract)] The central performance claim (XLM-R at 66.23% F1 via the proposed multi-target adversarial training, 4.64% above baseline) depends on the meta-learned coefficients successfully balancing the objectives without overfitting to the specific 36k labeled splits. The method description indicates dynamic balancing via loss reversal, but no coefficient trajectories, training dynamics, seed-wise variance, or ablation isolating the meta-learning component from other training choices are provided; this leaves the attribution of the gain unverified and is load-bearing for the main result.

Authors: We agree that further details on the meta-learning dynamics are needed to fully attribute the gains. In the revised manuscript, we will add plots showing the trajectories of the meta-learned coefficients across training epochs, report performance with standard deviations over multiple random seeds, and include an ablation study comparing the full meta-learning approach against variants with fixed coefficients or without the meta-component. These additions will help confirm the balancing mechanism's effectiveness. revision: yes
Referee: [Experimental evaluation (abstract and results)] No error bars, statistical significance tests, or full experimental protocol (including exact train/val/test partitions, hyperparameter search, and convergence criteria) accompany the reported F1 scores or the few-shot evaluation; this undermines assessment of whether the 4.64% margin is reproducible or statistically meaningful.

Authors: We acknowledge the importance of these elements for reproducibility. In the revision, we will report all F1 scores with error bars (standard deviation across 5 seeds), include statistical significance tests (e.g., paired t-tests with p-values) for the reported improvements, and expand the Experimental Setup section with exact train/val/test splits, the hyperparameter search procedure, and convergence criteria. Similar details will be added for the few-shot Llama-3.1-8B evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical F1 scores measured on held-out test splits of newly introduced dataset

full rationale

The central result is an F1-score of 66.23% achieved by XLM-R on the RoIt-XMASA test set after multi-target adversarial training. This metric is computed on data partitions that are disjoint from training, so the reported improvement over baseline does not reduce by construction to any fitted coefficient or self-citation. The meta-learned coefficients are internal to the training procedure; their effect is evaluated externally via held-out performance rather than being redefined as the result itself. No self-definitional equations, renamed known results, or load-bearing self-citations that collapse the claim are present in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus the effectiveness of the adversarial invariance mechanism; no new physical entities or ungrounded constants are introduced.

axioms (2)

domain assumption Labeled reviews in the dataset carry accurate sentiment annotations.
Required for any supervised sentiment training claim; invoked implicitly when reporting F1 scores.
domain assumption Adversarial loss reversal combined with meta-learned coefficients can produce representations that are invariant to domain and language while remaining discriminative for sentiment.
Core premise of the proposed multi-target framework stated in the abstract.

pith-pipeline@v0.9.0 · 5465 in / 1344 out tokens · 39232 ms · 2026-05-10T06:24:26.488092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages

[1]

InProceedings of the 31st annual ACM symposium on applied computing, 1140–1145

An evaluation of machine translation for multilingual sentence-level sentiment analysis. InProceedings of the 31st annual ACM symposium on applied computing, 1140–1145. Augustyniak, L.; Wo´zniak, S.; Gruza, M.; Gramacki, P.; Ra- jda, K.; Morzy, M.; and Kajdanowicz, T. 2023. Massively multilingual corpus of sentiment datasets and multi-faceted sentiment cl...

work page arXiv 2023
[2]

InProceedings of the 2023 conference on empirical methods in natural lan- guage processing, 11265–11279

UDAPDR: unsupervised domain adaptation via LLM prompting and distillation of rerankers. InProceedings of the 2023 conference on empirical methods in natural lan- guage processing, 11265–11279. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- tention is all you need.Advances in neural in...

2023
[3]

For most authors... (a) Would answering this research question advance sci- ence without violating social contracts, such as violat- ing privacy norms, perpetuating unfair profiling, exac- erbating the socio-economic divide, or implying disre- spect to societies or cultures? Yes, this work introduces a multilingual dataset to improve sentiment analy- sis ...
[4]

Additionally, if your study involves hypotheses testing... (a) Did you clearly state the assumptions underlying all theoretical results? N/A (b) Have you provided justifications for all theoretical re- sults? N/A (c) Did you discuss competing hypotheses or theories that might challenge or complement your theoretical re- sults? N/A (d) Have you considered ...
[5]

(a) Did you state the full set of assumptions of all theoret- ical results? N/A (b) Did you include complete proofs of all theoretical re- sults? N/A

Additionally, if you are including theoretical proofs... (a) Did you state the full set of assumptions of all theoret- ical results? N/A (b) Did you include complete proofs of all theoretical re- sults? N/A
[6]

Additionally, if you ran machine learning experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL)? Yes, the dataset is released on HuggingFace (see Footnote 1). (b) Did you specify all the training details (e.g., data splits, hyperparameters, ...
[7]

(a) If your work uses existing assets, did you cite the cre- ators? Yes, we cite the original XMASA dataset cre- ators and the developers of models like XLM-R and Llama-3.1

Additionally, if you are using existing assets (e.g., code, data, models) or curating/releasing new assets,without compromising anonymity... (a) If your work uses existing assets, did you cite the cre- ators? Yes, we cite the original XMASA dataset cre- ators and the developers of models like XLM-R and Llama-3.1. (b) Did you mention the license of the ass...
[8]

Additionally, if you used crowdsourcing or conducted research with human subjects,without compromising anonymity... (a) Did you include the full text of instructions given to participants and screenshots? N/A (b) Did you describe any potential participant risks, with mentions of Institutional Review Board (IRB) ap- provals? N/A (c) Did you include the est...

2007
[9]

This includes ellipses (”......”→”...”) and mixed punctuation patterns

while preserving essential, semantic, and syntactic in- formation for sentiment analysis: 1.Punctuation normalization: Multiple consecutive punctuation marks were reduced to a maximum of three instances (e.g., ”!!!!!”→”!!!”), preserving emphasis while preventing excessive repetition. This includes ellipses (”......”→”...”) and mixed punctuation patterns. ...

2019