pith. sign in

arxiv: 2604.26619 · v1 · submitted 2026-04-29 · 💻 cs.CL

Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis

Pith reviewed 2026-05-07 11:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords aspect-based sentiment analysiscross-lingual transfermultilingual NLPlarge language modelszero-shot learningcode-switchingGerman datasets
0
0 comments X

The pith

Fine-tuned large language models achieve the best results in multilingual aspect-based sentiment analysis when using cross-lingual training on multiple languages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates state-of-the-art transformer models for aspect-based sentiment analysis across seven languages and four subtasks under zero-resource to full-resource conditions. It establishes that fine-tuned large language models deliver the highest scores overall, especially on complex generative tasks, and gain the most from cross-lingual training that draws on multiple non-target languages. Smaller encoder and sequence-to-sequence models remain competitive in simpler setups when code-switching is applied instead. The authors also release two new German datasets to support research that moves beyond English-centric work.

Core claim

Fine-tuned Large Language Models achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching.

What carries the argument

Architecture-specific transfer strategies that combine cross-lingual training on multiple languages, code-switching, and machine translation, tested across model sizes from small encoders to large language models under varying resource levels.

If this is right

  • Fine-tuned LLMs should be selected first for complex generative ABSA subtasks in multilingual settings.
  • Training on multiple non-target languages maximizes transfer gains specifically for large models.
  • Code-switching offers the most benefit to smaller encoder and seq-to-seq models in low-data regimes.
  • The new German datasets enable direct comparison and further development of non-English ABSA systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same architecture-dependent transfer rules may guide choices in other fine-grained multilingual NLP tasks such as opinion mining or entity linking.
  • For very low-resource languages, the optimal path could be to first collect modest amounts of multi-language data rather than attempting full annotation in the target language alone.
  • Task complexity should influence model-size selection more than raw data volume when designing cross-lingual ABSA pipelines.

Load-bearing premise

That the seven languages and four subtasks are representative of broader multilingual ABSA challenges and that the new German datasets carry no annotation bias that would distort model comparisons.

What would settle it

A reversal in performance rankings between fine-tuned LLMs and smaller models using code-switching when tested on a different set of languages or subtasks would indicate the architecture-specific strategies do not hold more generally.

Figures

Figures reproduced from arXiv: 2604.26619 by Christian Wolff, Jakob Fehle, Nils Constantin Hellwig, Udo Kruschwitz.

Figure 1
Figure 1. Figure 1: The diagram illustrates both the absolute dataset sizes and aspect category distributions across view at source ↗
Figure 2
Figure 2. Figure 2: Prompt example for the ACD task for the English-language SemEval 2016 restaurant dataset. view at source ↗
Figure 3
Figure 3. Figure 3: Prompt example for the ACSA task for the English-language SemEval 2016 restaurant dataset. view at source ↗
Figure 4
Figure 4. Figure 4: Prompt example for the TASD task for the English-language SemEval 2016 restaurant dataset. view at source ↗
read the original abstract

Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a multilingual evaluation of ABSA methods across seven languages (English, German, French, Dutch, Russian, Spanish, Czech) and four subtasks (ACD, ACSA, TASD, ASQP). It compares transformer architectures (LLMs, encoders, seq-to-seq) under zero-resource, few-shot, and full-resource settings using cross-lingual transfer, code-switching, and machine translation. The central claims are that fine-tuned LLMs achieve the highest scores especially on complex generative tasks, few-shot LLMs and smaller encoders are competitive on simpler tasks, cross-lingual training on multiple languages works best for LLMs while code-switching benefits smaller models, and two new German datasets (adapted GERestaurant and GERest) are contributed to support the evaluation.

Significance. If the empirical comparisons hold after addressing data-quality reporting, the work would offer practical, architecture-specific guidance for multilingual ABSA transfer and add needed German resources to an English-dominated field. The systematic zero-to-full-resource framing and identification of strategy differences by model scale could inform broader multilingual NLP experiments.

major comments (2)
  1. [Dataset contribution] Dataset contribution section: The headline architecture-specific strategy claims rest on direct numerical comparisons that include the two newly contributed German datasets (GERestaurant, GERest). No inter-annotator agreement figures, annotation protocol, guidelines, or adjudication process are reported, so it is impossible to verify that label quality and aspect-boundary conventions are comparable to the other languages; any systematic differences would artifactually affect the reported performance orderings and transfer-method rankings.
  2. [Results and experimental sections] Results and experimental sections: The abstract asserts concrete performance orderings (fine-tuned LLMs highest overall, code-switching optimal for smaller models) and the reader's summary references held-out empirical outcomes, yet no tables, figures, error bars, or statistical significance tests are referenced in the provided description. Without these, the strength of evidence for the cross-lingual vs. code-switching differential cannot be evaluated.
minor comments (1)
  1. [Abstract] Abstract: The summary of findings would be clearer if it briefly indicated the magnitude of the reported gains or the specific metrics used (e.g., F1 for each subtask).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving transparency around the new datasets and the presentation of experimental evidence. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Dataset contribution] Dataset contribution section: The headline architecture-specific strategy claims rest on direct numerical comparisons that include the two newly contributed German datasets (GERestaurant, GERest). No inter-annotator agreement figures, annotation protocol, guidelines, or adjudication process are reported, so it is impossible to verify that label quality and aspect-boundary conventions are comparable to the other languages; any systematic differences would artifactually affect the reported performance orderings and transfer-method rankings.

    Authors: We agree that the annotation details for the contributed German datasets require fuller documentation to support the comparability claims. The datasets were created by adapting established English ABSA resources using standard annotation practices for aspect boundaries and sentiment labels. In the revised manuscript we will add a dedicated subsection describing the full annotation protocol, guidelines, adjudication process, and inter-annotator agreement statistics computed during creation. This will allow readers to assess label quality directly. revision: yes

  2. Referee: [Results and experimental sections] Results and experimental sections: The abstract asserts concrete performance orderings (fine-tuned LLMs highest overall, code-switching optimal for smaller models) and the reader's summary references held-out empirical outcomes, yet no tables, figures, error bars, or statistical significance tests are referenced in the provided description. Without these, the strength of evidence for the cross-lingual vs. code-switching differential cannot be evaluated.

    Authors: The full manuscript contains a dedicated results section with multiple tables and figures that report all performance metrics across languages, subtasks, and resource settings, directly supporting the abstract claims. To strengthen the presentation we will add explicit references to these tables and figures from the abstract and introduction, include error bars on relevant plots, and incorporate statistical significance tests (e.g., McNemar or paired t-tests) for the key architecture-specific comparisons. This will make the evidence for the cross-lingual versus code-switching differential fully evaluable. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on held-out data

full rationale

The paper reports direct experimental outcomes from fine-tuning and evaluating transformer models (encoder, seq-to-seq, LLM) on ABSA subtasks across seven languages under zero-shot, few-shot, and full-resource regimes. Performance differences are measured on test sets, including the newly introduced German resources, with no equations, parameter fits, or derivations presented as predictions. No self-citations are invoked to justify uniqueness theorems or ansatzes; claims about architecture-specific transfer strategies (e.g., cross-lingual training for LLMs vs. code-switching for smaller models) are conclusions drawn from the observed numerical results rather than reductions to prior self-referential inputs. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on standard empirical NLP assumptions rather than new derivations or entities.

axioms (1)
  • domain assumption Standard machine learning evaluation assumptions hold, including representative datasets and useful metrics.
    Implicit when claiming one transfer strategy is strongest.

pith-pipeline@v0.9.0 · 8930 in / 1029 out tokens · 127685 ms · 2026-05-07T11:08:03.546765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Introduction Aspect-based Sentiment Analysis (ABSA) has be- come a central task for mining fine-grained opin- ions, aiming to detect sentiment toward specific as- pects within text. Despite substantial methodologi- cal advances, from transfer learning-based classi- fiers(Caietal.,2020;Cuietal.,2024)toinstruction- tuned large language models (LLMs) (Scaria...

  2. [2]

    and Czech (Šmíd et al., 2024b) datasets following the same schema, andcontribute the first German ASQP dataset, GERest, to enable cross-lingual ASQP evaluation. We systematically compare three modeling paradigms: (a) Encoder-only classification, which concep- tualize ABSA as a supervised multi-label clas- sification problem, including BERT-based ar- chite...

  3. [3]

    In thezero-resource setting, neither anno- tated data nor language-specific models are available; models must rely solely on cross- lingual transfer capabilities

  4. [4]

    In thedata-only setting, annotated training data in the target language is available, but no dedicated language-specific model exists, requiring multilingual models to adapt to the language

  5. [5]

    To enhance zero-resource settings, we apply code-switching and machine-translation augmen- tation (Zhang et al., 2021a) to generate pseudo- training data from English

    In thefull-resource setting, both annotated data and language-specific pre-trained mod- els are available, allowing us to assess the performance ceiling for each language. To enhance zero-resource settings, we apply code-switching and machine-translation augmen- tation (Zhang et al., 2021a) to generate pseudo- training data from English. Our study provide...

  6. [6]

    Related Work In recent years, ABSA has seen substantial progressthroughbothclassification-basedandgen- erative modeling approaches. 2.1. State-of-the-Art Modeling Approaches for ABSA Recent advances in ABSA span a continuum from supervised classification to generative and instruction-based approaches. For simpler sub- tasks such as ACD and ACSA, transform...

  7. [7]

    Tasks ABSA comprises a number of subtasks that differ in the level of detail and the type of information they extract from the text

    Methodology 3.1. Tasks ABSA comprises a number of subtasks that differ in the level of detail and the type of information they extract from the text. In this work, we focus on four common ABSA tasks with different levels of granularity that are supported by the structure and annotations of our multilingual datasets: Aspect Category Detection (ACD), Aspect...

  8. [8]

    and language-specific variants, such asru- BERT3 (Kuratov and Arkhipov, 2019) for Russian. 3.4.2. Seq-2-Seq Text Generation • DLO:DynamicLabelOrdering(Huetal.,2022) reformulates ABSA as a generative task by dy- namically augmenting and reordering output tuples(e.g., forTASD,ASQP),improvingalign- ment between input and structured outputs. Models: Multiling...

  9. [9]

    Results 4.1. Results for Monolingual Training In the monolingual setting, where training and testing use the same language, we compare two configurations on the balanced datasets (see Ta- ble2): Multi=amultilingualmodel(e.g., mT5-base) fine-tuned per language, and Spec = a language- specific model (e.g.,ruT5-base for Russian) fine- tuned on the same data....

  10. [10]

    Conclusion & Future Work This work presented a comprehensive multilingual evaluation of SOTA approaches for ABSA across seven languages and four subtasks. By com- paring encoder-only, sequence-to-sequence, and decoder-only architectures under varying resource conditions, we analyzed how well current models generalize across languages and ABSA tasks. Our r...

  11. [11]

    Bibliographical References Md Shad Akhtar, Asif Ekbal, and Pushpak Bhat- tacharyya. 2016. Aspect based sentiment analy- sis in Hindi: Resource creation and evaluation. In Proceedings of the Tenth International Con- ference on Language Resources and Evaluation (LREC’16), pages 2703–2709, Portorož, Slove- nia. European Language Resources Association (ELRA)....

  12. [12]

    InProceedings of COLING 2016, the 26th International Confer- ence on Computational Linguistics: Technical Papers, pages 1613–1623

    Exploring distributional representations and machine translation for aspect-based cross- lingual sentiment classification. InProceedings of COLING 2016, the 26th International Confer- ence on Computational Linguistics: Technical Papers, pages 1613–1623. The COLING 2016 Organizing Committee. 9Claude Sonnet:https://www.anthropic.com/ claude/sonnet Hongjie C...

  13. [13]

    Chengyan Wu, Bolei Ma, Ningyuan Deng, Yanqing He, and Yun Xue

    Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference.arXiv [cs.CL]. Chengyan Wu, Bolei Ma, Ningyuan Deng, Yanqing He, and Yun Xue. 2025a. Multi-scale and multi- objective optimization for cross-lingual aspect- based sentiment analysis.arXiv [cs.CL]. Chengyan Wu, Bolei Ma, Nin...