Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis
Pith reviewed 2026-05-07 11:08 UTC · model grok-4.3
The pith
Fine-tuned large language models achieve the best results in multilingual aspect-based sentiment analysis when using cross-lingual training on multiple languages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuned Large Language Models achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching.
What carries the argument
Architecture-specific transfer strategies that combine cross-lingual training on multiple languages, code-switching, and machine translation, tested across model sizes from small encoders to large language models under varying resource levels.
If this is right
- Fine-tuned LLMs should be selected first for complex generative ABSA subtasks in multilingual settings.
- Training on multiple non-target languages maximizes transfer gains specifically for large models.
- Code-switching offers the most benefit to smaller encoder and seq-to-seq models in low-data regimes.
- The new German datasets enable direct comparison and further development of non-English ABSA systems.
Where Pith is reading between the lines
- The same architecture-dependent transfer rules may guide choices in other fine-grained multilingual NLP tasks such as opinion mining or entity linking.
- For very low-resource languages, the optimal path could be to first collect modest amounts of multi-language data rather than attempting full annotation in the target language alone.
- Task complexity should influence model-size selection more than raw data volume when designing cross-lingual ABSA pipelines.
Load-bearing premise
That the seven languages and four subtasks are representative of broader multilingual ABSA challenges and that the new German datasets carry no annotation bias that would distort model comparisons.
What would settle it
A reversal in performance rankings between fine-tuned LLMs and smaller models using code-switching when tested on a different set of languages or subtasks would indicate the architecture-specific strategies do not hold more generally.
Figures
read the original abstract
Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a multilingual evaluation of ABSA methods across seven languages (English, German, French, Dutch, Russian, Spanish, Czech) and four subtasks (ACD, ACSA, TASD, ASQP). It compares transformer architectures (LLMs, encoders, seq-to-seq) under zero-resource, few-shot, and full-resource settings using cross-lingual transfer, code-switching, and machine translation. The central claims are that fine-tuned LLMs achieve the highest scores especially on complex generative tasks, few-shot LLMs and smaller encoders are competitive on simpler tasks, cross-lingual training on multiple languages works best for LLMs while code-switching benefits smaller models, and two new German datasets (adapted GERestaurant and GERest) are contributed to support the evaluation.
Significance. If the empirical comparisons hold after addressing data-quality reporting, the work would offer practical, architecture-specific guidance for multilingual ABSA transfer and add needed German resources to an English-dominated field. The systematic zero-to-full-resource framing and identification of strategy differences by model scale could inform broader multilingual NLP experiments.
major comments (2)
- [Dataset contribution] Dataset contribution section: The headline architecture-specific strategy claims rest on direct numerical comparisons that include the two newly contributed German datasets (GERestaurant, GERest). No inter-annotator agreement figures, annotation protocol, guidelines, or adjudication process are reported, so it is impossible to verify that label quality and aspect-boundary conventions are comparable to the other languages; any systematic differences would artifactually affect the reported performance orderings and transfer-method rankings.
- [Results and experimental sections] Results and experimental sections: The abstract asserts concrete performance orderings (fine-tuned LLMs highest overall, code-switching optimal for smaller models) and the reader's summary references held-out empirical outcomes, yet no tables, figures, error bars, or statistical significance tests are referenced in the provided description. Without these, the strength of evidence for the cross-lingual vs. code-switching differential cannot be evaluated.
minor comments (1)
- [Abstract] Abstract: The summary of findings would be clearer if it briefly indicated the magnitude of the reported gains or the specific metrics used (e.g., F1 for each subtask).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving transparency around the new datasets and the presentation of experimental evidence. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Dataset contribution] Dataset contribution section: The headline architecture-specific strategy claims rest on direct numerical comparisons that include the two newly contributed German datasets (GERestaurant, GERest). No inter-annotator agreement figures, annotation protocol, guidelines, or adjudication process are reported, so it is impossible to verify that label quality and aspect-boundary conventions are comparable to the other languages; any systematic differences would artifactually affect the reported performance orderings and transfer-method rankings.
Authors: We agree that the annotation details for the contributed German datasets require fuller documentation to support the comparability claims. The datasets were created by adapting established English ABSA resources using standard annotation practices for aspect boundaries and sentiment labels. In the revised manuscript we will add a dedicated subsection describing the full annotation protocol, guidelines, adjudication process, and inter-annotator agreement statistics computed during creation. This will allow readers to assess label quality directly. revision: yes
-
Referee: [Results and experimental sections] Results and experimental sections: The abstract asserts concrete performance orderings (fine-tuned LLMs highest overall, code-switching optimal for smaller models) and the reader's summary references held-out empirical outcomes, yet no tables, figures, error bars, or statistical significance tests are referenced in the provided description. Without these, the strength of evidence for the cross-lingual vs. code-switching differential cannot be evaluated.
Authors: The full manuscript contains a dedicated results section with multiple tables and figures that report all performance metrics across languages, subtasks, and resource settings, directly supporting the abstract claims. To strengthen the presentation we will add explicit references to these tables and figures from the abstract and introduction, include error bars on relevant plots, and incorporate statistical significance tests (e.g., McNemar or paired t-tests) for the key architecture-specific comparisons. This will make the evidence for the cross-lingual versus code-switching differential fully evaluable. revision: partial
Circularity Check
No circularity: purely empirical evaluation on held-out data
full rationale
The paper reports direct experimental outcomes from fine-tuning and evaluating transformer models (encoder, seq-to-seq, LLM) on ABSA subtasks across seven languages under zero-shot, few-shot, and full-resource regimes. Performance differences are measured on test sets, including the newly introduced German resources, with no equations, parameter fits, or derivations presented as predictions. No self-citations are invoked to justify uniqueness theorems or ansatzes; claims about architecture-specific transfer strategies (e.g., cross-lingual training for LLMs vs. code-switching for smaller models) are conclusions drawn from the observed numerical results rather than reductions to prior self-referential inputs. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard machine learning evaluation assumptions hold, including representative datasets and useful metrics.
Reference graph
Works this paper leans on
-
[1]
Introduction Aspect-based Sentiment Analysis (ABSA) has be- come a central task for mining fine-grained opin- ions, aiming to detect sentiment toward specific as- pects within text. Despite substantial methodologi- cal advances, from transfer learning-based classi- fiers(Caietal.,2020;Cuietal.,2024)toinstruction- tuned large language models (LLMs) (Scaria...
work page 2020
-
[2]
and Czech (Šmíd et al., 2024b) datasets following the same schema, andcontribute the first German ASQP dataset, GERest, to enable cross-lingual ASQP evaluation. We systematically compare three modeling paradigms: (a) Encoder-only classification, which concep- tualize ABSA as a supervised multi-label clas- sification problem, including BERT-based ar- chite...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
In thezero-resource setting, neither anno- tated data nor language-specific models are available; models must rely solely on cross- lingual transfer capabilities
-
[4]
In thedata-only setting, annotated training data in the target language is available, but no dedicated language-specific model exists, requiring multilingual models to adapt to the language
-
[5]
In thefull-resource setting, both annotated data and language-specific pre-trained mod- els are available, allowing us to assess the performance ceiling for each language. To enhance zero-resource settings, we apply code-switching and machine-translation augmen- tation (Zhang et al., 2021a) to generate pseudo- training data from English. Our study provide...
-
[6]
Related Work In recent years, ABSA has seen substantial progressthroughbothclassification-basedandgen- erative modeling approaches. 2.1. State-of-the-Art Modeling Approaches for ABSA Recent advances in ABSA span a continuum from supervised classification to generative and instruction-based approaches. For simpler sub- tasks such as ACD and ACSA, transform...
work page 2023
-
[7]
Methodology 3.1. Tasks ABSA comprises a number of subtasks that differ in the level of detail and the type of information they extract from the text. In this work, we focus on four common ABSA tasks with different levels of granularity that are supported by the structure and annotations of our multilingual datasets: Aspect Category Detection (ACD), Aspect...
work page 2023
-
[8]
and language-specific variants, such asru- BERT3 (Kuratov and Arkhipov, 2019) for Russian. 3.4.2. Seq-2-Seq Text Generation • DLO:DynamicLabelOrdering(Huetal.,2022) reformulates ABSA as a generative task by dy- namically augmenting and reordering output tuples(e.g., forTASD,ASQP),improvingalign- ment between input and structured outputs. Models: Multiling...
work page 2019
-
[9]
Results 4.1. Results for Monolingual Training In the monolingual setting, where training and testing use the same language, we compare two configurations on the balanced datasets (see Ta- ble2): Multi=amultilingualmodel(e.g., mT5-base) fine-tuned per language, and Spec = a language- specific model (e.g.,ruT5-base for Russian) fine- tuned on the same data....
work page 2019
-
[10]
Conclusion & Future Work This work presented a comprehensive multilingual evaluation of SOTA approaches for ABSA across seven languages and four subtasks. By com- paring encoder-only, sequence-to-sequence, and decoder-only architectures under varying resource conditions, we analyzed how well current models generalize across languages and ABSA tasks. Our r...
work page 2024
-
[11]
Bibliographical References Md Shad Akhtar, Asif Ekbal, and Pushpak Bhat- tacharyya. 2016. Aspect based sentiment analy- sis in Hindi: Resource creation and evaluation. In Proceedings of the Tenth International Con- ference on Language Resources and Evaluation (LREC’16), pages 2703–2709, Portorož, Slove- nia. European Language Resources Association (ELRA)....
work page 2016
-
[12]
Exploring distributional representations and machine translation for aspect-based cross- lingual sentiment classification. InProceedings of COLING 2016, the 26th International Confer- ence on Computational Linguistics: Technical Papers, pages 1613–1623. The COLING 2016 Organizing Committee. 9Claude Sonnet:https://www.anthropic.com/ claude/sonnet Hongjie C...
work page 2016
-
[13]
Chengyan Wu, Bolei Ma, Ningyuan Deng, Yanqing He, and Yun Xue
Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference.arXiv [cs.CL]. Chengyan Wu, Bolei Ma, Ningyuan Deng, Yanqing He, and Yun Xue. 2025a. Multi-scale and multi- objective optimization for cross-lingual aspect- based sentiment analysis.arXiv [cs.CL]. Chengyan Wu, Bolei Ma, Nin...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.