pith. sign in

arxiv: 2606.31718 · v1 · pith:I4WTHBBLnew · submitted 2026-06-30 · 💻 cs.CL · cs.AI

Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian

Pith reviewed 2026-07-01 05:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords relation extractioncross-lingualRomanianlarge language modelsQLoRAzero-shotfew-shotencoder models
0
0 comments X

The pith

Fine-tuned smaller encoders match a 31B LLM within 1-4 points on Romanian relation extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a 31-billion-parameter LLM is required for relation extraction in Romanian by translating an English benchmark dataset and comparing Gemma under zero-shot, few-shot, and QLoRA fine-tuned conditions to much smaller encoder models. It finds that prompt-only performance drops 3-5 points for Romanian versus English, that few-shot adds little, and that QLoRA raises scores by more than 22 points while shrinking the language gap to 1.4 points. The smaller encoders, some monolingual and 50-250 times smaller, stay within 1-4 points of the fine-tuned large model. This matters because it indicates that compute-heavy LLMs may not be needed for this specific task once fine-tuning is allowed.

Core claim

QLoRA fine-tuning improves macro F1-Score by more than 22 percentage points in both languages while reducing the cross-lingual gap from 3.3 to 1.4pp. The encoder baselines come within 1-4pp of QLoRA Gemma on Romanian despite being 50-250 times smaller, with monolingual Romanian BERT at 125M parameters matching multilingual XLM-R at 278M. The case for using a 31B model for single-task RE on Romanian is therefore weak in deployment scenarios where compute matters.

What carries the argument

The LLM-based automatic translation of the SemEval-2010 Task 8 dataset to Romanian, followed by direct performance comparison of QLoRA fine-tuned Gemma 31B against encoder baselines in both relation classification and end-to-end extraction settings.

If this is right

  • QLoRA fine-tuning closes most of the performance difference between English and Romanian for this task.
  • Few-shot prompting yields only small gains over zero-shot prompting for relation extraction.
  • Monolingual Romanian BERT performs as well as larger multilingual encoders on the Romanian data.
  • For single-task relation extraction, the extra scale of a 31B model does not justify its cost once fine-tuning is used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same translation-plus-fine-tuning approach could be tested on other low-resource languages that already have small monolingual encoders.
  • If similar patterns hold, practitioners could default to fine-tuned small models for many non-English information extraction tasks rather than scaling to large LLMs.
  • Energy and latency budgets in production systems would favor the smaller encoders shown here.

Load-bearing premise

The automatic translation pipeline produces a Romanian dataset that keeps the original relation labels and entity annotations accurate enough for the performance numbers to be trustworthy.

What would settle it

A human review of a random sample of the translated Romanian sentences that finds frequent label flips or broken entity spans would make the reported cross-lingual gaps and model comparisons unreliable.

read the original abstract

Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large language model (LLM) inference. We translate the SemEval-2010 Task 8 benchmark from English to Romanian using an LLM-based translation pipeline and evaluate Gemma 4 31B under zero-shot, few-shot, and QLoRA fine-tuned configurations, against four encoder baselines spanning 125M to 560M parameters: XLM- RoBERTa (base and large), Romanian BERT, and RoBERT- large. We assess two task formulations: relation classification with marked entities and end-to-end extraction. Our results show that Romanian incurs a 3 to 5 percentage point (pp) drop relative to English in prompt-only settings, that few-shot prompting provides marginal gains over zero-shot, and that QLoRA fine-tuning improves macro F1-Score by more than 22 percentage points in both languages while reducing the cross-lingual gap from 3.3 to 1.4pp. The encoder baselines come within 1-4pp of QLoRA Gemma on Romanian despite being 50-250 times smaller, with monolingual Romanian BERT at 125M parameters matching multilingual XLM-R at 278M. The case for using a 31B model for single-task RE on Romanian is therefore weak in deployment scenarios where compute matters. We release the translated dataset, evaluation code, and trained models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper evaluates cross-lingual relation extraction on Romanian by automatically translating the English SemEval-2010 Task 8 dataset via an LLM pipeline, then benchmarking Gemma 4 31B under zero-shot, few-shot, and QLoRA fine-tuning against encoder baselines (XLM-RoBERTa base/large, Romanian BERT, RoBERT-large). It reports a 3-5pp English-to-Romanian drop in prompt-only settings, marginal few-shot gains, >22pp macro-F1 improvement from QLoRA in both languages (reducing the gap from 3.3 to 1.4pp), and competitive performance from the much smaller encoders (within 1-4pp of the fine-tuned LLM).

Significance. If the translated Romanian data preserves original labels and entities with high fidelity, the findings would demonstrate that parameter-efficient fine-tuning can substantially close cross-lingual gaps in RE while showing that 125-560M encoder models remain practical alternatives to 31B LLMs for single-task deployment in low-resource settings.

major comments (1)
  1. [Abstract / §3] Abstract and §3 (dataset construction): All headline results (22+pp QLoRA gains, gap reduction to 1.4pp, encoder parity within 1-4pp) rest on the LLM-translated Romanian version of SemEval-2010 Task 8 being a faithful proxy. No human validation, entity-span mismatch rates, relation-label drift statistics, or even spot-check accuracy is reported for the translation pipeline used in both training and test data. This is load-bearing for the cross-lingual claims.
minor comments (2)
  1. [Abstract] The abstract refers to 'Gemma 4 31B' without clarifying the exact model variant or providing a citation; this should be standardized.
  2. [Results] Task formulations (relation classification with marked entities vs. end-to-end extraction) are mentioned but not clearly separated in the reported metrics; a table breaking down results by formulation would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the critical importance of translation fidelity for our cross-lingual claims. We agree this is a substantive point and address it directly below.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (dataset construction): All headline results (22+pp QLoRA gains, gap reduction to 1.4pp, encoder parity within 1-4pp) rest on the LLM-translated Romanian version of SemEval-2010 Task 8 being a faithful proxy. No human validation, entity-span mismatch rates, relation-label drift statistics, or even spot-check accuracy is reported for the translation pipeline used in both training and test data. This is load-bearing for the cross-lingual claims.

    Authors: We agree that the absence of reported validation metrics for the LLM translation pipeline is a limitation, as the quality of the Romanian data directly supports all headline cross-lingual results. The current manuscript describes the pipeline in §3 but provides no human evaluation, entity-span accuracy, label drift statistics, or spot-checks. In the revised version we will add a dedicated validation subsection reporting human assessment on a stratified sample of 200 translated instances (covering both train and test splits), including entity-span preservation rate, relation-label consistency, and overall fidelity scores. This will be presented alongside the existing pipeline description. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The manuscript is an empirical evaluation study that translates an existing English benchmark via LLM pipeline, runs zero/few-shot and QLoRA experiments on Gemma, and compares against encoder baselines. All reported metrics (macro F1 deltas, cross-lingual gaps, model size comparisons) are direct experimental outcomes on held-out test data. No equations, fitted parameters, uniqueness theorems, or ansatzes are defined inside the work and then re-used as predictions. No self-citations serve as load-bearing justification for core claims. The translation fidelity concern is a validity issue for the experimental setup, not a circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper; relies on standard NLP assumptions about metric validity and translation fidelity but introduces no new mathematical axioms, free parameters, or postulated entities.

axioms (1)
  • standard math Standard evaluation metrics such as macro F1 are appropriate and sufficient for comparing relation extraction systems.
    All reported scores rest on this conventional assumption.

pith-pipeline@v0.9.1-grok · 5839 in / 1265 out tokens · 38034 ms · 2026-07-01T05:19:27.720554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 1 canonical work pages

  1. [1]

    SemEval- 2010 task 8: Multi-way classification of semantic relations between pairs of nominals,

    I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. ´O S ´eaghdha, S. Pad ´o, M. Pennacchiotti, L. Romano, and S. Szpakowicz, “SemEval- 2010 task 8: Multi-way classification of semantic relations between pairs of nominals,” inProc. SemEval Workshop at ACL, 2010, pp. 33–38

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku,

    Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” An- thropic, Tech. Rep., 2024

  3. [3]

    Gemma 4 technical report,

    Google DeepMind, “Gemma 4 technical report,” Google DeepMind, Tech. Rep., Apr. 2026, open-weight model release

  4. [4]

    QLoRA: Efficient finetuning of quantized language models,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized language models,” inProc. NeurIPS, 2023

  5. [5]

    Kernel methods for relation extraction,

    D. Zelenko, C. Aone, and A. Richardella, “Kernel methods for relation extraction,”Journal of Machine Learning Research, vol. 3, pp. 1083– 1106, 2003

  6. [6]

    Relation classification via convolutional deep neural network,

    D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, “Relation classification via convolutional deep neural network,” inProc. COLING, 2014, pp. 2335–2344

  7. [7]

    Entity, relation, and event extraction with contextualized span representations,

    D. Wadden, U. Wennberg, Y . Luan, and H. Hajishirzi, “Entity, relation, and event extraction with contextualized span representations,” inProc. EMNLP-IJCNLP, 2019, pp. 5784–5789

  8. [8]

    Match- ing the blanks: Distributional similarity matters for relation learning,

    L. Baldini Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski, “Match- ing the blanks: Distributional similarity matters for relation learning,” inProc. ACL, 2019, pp. 2895–2905

  9. [9]

    Structured prediction as translation between augmented natural languages,

    G. Paolini, B. Athiwaratkun, J. Krone, J. Ma, A. Achille, R. Anubhai, C. N. dos Santos, B. Xiang, and S. Soatto, “Structured prediction as translation between augmented natural languages,” inProc. ICLR, 2021

  10. [10]

    Un- supervised cross-lingual representation learning at scale,

    A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Un- supervised cross-lingual representation learning at scale,” inProc. ACL, 2020, pp. 8440–8451

  11. [11]

    Multilingual open relation extraction using cross-lingual projection,

    M. Faruqui and S. Kumar, “Multilingual open relation extraction using cross-lingual projection,” inProc. NAACL-HLT, 2015, pp. 1351–1356

  12. [12]

    Zero-shot information extraction via chatting with ChatGPT,

    X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, S. Huang, P. Xie, J. Xu, Y . Chen, M. Zhang, Y . Jiang, and W. Han, “Zero-shot information extraction via chatting with ChatGPT,”arXiv preprint arXiv:2302.10205, 2023

  13. [13]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inProc. ICLR, 2022

  14. [14]

    The birth of Romanian BERT,

    S. D. Dumitrescu, A.-M. Avram, and S. Pyysalo, “The birth of Romanian BERT,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4324–4328

  15. [15]

    RoBERT – a Romanian BERT model,

    M. Masala, S. Ruseti, and M. Dasc ˘alu, “RoBERT – a Romanian BERT model,” inProc. COLING, 2020, pp. 6626–6637

  16. [16]

    Introducing RONEC – the Romanian named entity corpus,

    S. D. Dumitrescu and A.-M. Avram, “Introducing RONEC – the Romanian named entity corpus,” inProc. LREC, 2020, pp. 4436–4443