Cross-lingual Relation Extraction with Large Language Models: Zero-Shot, Few-Shot, and Fine-Tuned Evaluation on Romanian
Pith reviewed 2026-07-01 05:19 UTC · model grok-4.3
The pith
Fine-tuned smaller encoders match a 31B LLM within 1-4 points on Romanian relation extraction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QLoRA fine-tuning improves macro F1-Score by more than 22 percentage points in both languages while reducing the cross-lingual gap from 3.3 to 1.4pp. The encoder baselines come within 1-4pp of QLoRA Gemma on Romanian despite being 50-250 times smaller, with monolingual Romanian BERT at 125M parameters matching multilingual XLM-R at 278M. The case for using a 31B model for single-task RE on Romanian is therefore weak in deployment scenarios where compute matters.
What carries the argument
The LLM-based automatic translation of the SemEval-2010 Task 8 dataset to Romanian, followed by direct performance comparison of QLoRA fine-tuned Gemma 31B against encoder baselines in both relation classification and end-to-end extraction settings.
If this is right
- QLoRA fine-tuning closes most of the performance difference between English and Romanian for this task.
- Few-shot prompting yields only small gains over zero-shot prompting for relation extraction.
- Monolingual Romanian BERT performs as well as larger multilingual encoders on the Romanian data.
- For single-task relation extraction, the extra scale of a 31B model does not justify its cost once fine-tuning is used.
Where Pith is reading between the lines
- The same translation-plus-fine-tuning approach could be tested on other low-resource languages that already have small monolingual encoders.
- If similar patterns hold, practitioners could default to fine-tuned small models for many non-English information extraction tasks rather than scaling to large LLMs.
- Energy and latency budgets in production systems would favor the smaller encoders shown here.
Load-bearing premise
The automatic translation pipeline produces a Romanian dataset that keeps the original relation labels and entity annotations accurate enough for the performance numbers to be trustworthy.
What would settle it
A human review of a random sample of the translated Romanian sentences that finds frequent label flips or broken entity spans would make the reported cross-lingual gaps and model comparisons unreliable.
read the original abstract
Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large language model (LLM) inference. We translate the SemEval-2010 Task 8 benchmark from English to Romanian using an LLM-based translation pipeline and evaluate Gemma 4 31B under zero-shot, few-shot, and QLoRA fine-tuned configurations, against four encoder baselines spanning 125M to 560M parameters: XLM- RoBERTa (base and large), Romanian BERT, and RoBERT- large. We assess two task formulations: relation classification with marked entities and end-to-end extraction. Our results show that Romanian incurs a 3 to 5 percentage point (pp) drop relative to English in prompt-only settings, that few-shot prompting provides marginal gains over zero-shot, and that QLoRA fine-tuning improves macro F1-Score by more than 22 percentage points in both languages while reducing the cross-lingual gap from 3.3 to 1.4pp. The encoder baselines come within 1-4pp of QLoRA Gemma on Romanian despite being 50-250 times smaller, with monolingual Romanian BERT at 125M parameters matching multilingual XLM-R at 278M. The case for using a 31B model for single-task RE on Romanian is therefore weak in deployment scenarios where compute matters. We release the translated dataset, evaluation code, and trained models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates cross-lingual relation extraction on Romanian by automatically translating the English SemEval-2010 Task 8 dataset via an LLM pipeline, then benchmarking Gemma 4 31B under zero-shot, few-shot, and QLoRA fine-tuning against encoder baselines (XLM-RoBERTa base/large, Romanian BERT, RoBERT-large). It reports a 3-5pp English-to-Romanian drop in prompt-only settings, marginal few-shot gains, >22pp macro-F1 improvement from QLoRA in both languages (reducing the gap from 3.3 to 1.4pp), and competitive performance from the much smaller encoders (within 1-4pp of the fine-tuned LLM).
Significance. If the translated Romanian data preserves original labels and entities with high fidelity, the findings would demonstrate that parameter-efficient fine-tuning can substantially close cross-lingual gaps in RE while showing that 125-560M encoder models remain practical alternatives to 31B LLMs for single-task deployment in low-resource settings.
major comments (1)
- [Abstract / §3] Abstract and §3 (dataset construction): All headline results (22+pp QLoRA gains, gap reduction to 1.4pp, encoder parity within 1-4pp) rest on the LLM-translated Romanian version of SemEval-2010 Task 8 being a faithful proxy. No human validation, entity-span mismatch rates, relation-label drift statistics, or even spot-check accuracy is reported for the translation pipeline used in both training and test data. This is load-bearing for the cross-lingual claims.
minor comments (2)
- [Abstract] The abstract refers to 'Gemma 4 31B' without clarifying the exact model variant or providing a citation; this should be standardized.
- [Results] Task formulations (relation classification with marked entities vs. end-to-end extraction) are mentioned but not clearly separated in the reported metrics; a table breaking down results by formulation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for highlighting the critical importance of translation fidelity for our cross-lingual claims. We agree this is a substantive point and address it directly below.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and §3 (dataset construction): All headline results (22+pp QLoRA gains, gap reduction to 1.4pp, encoder parity within 1-4pp) rest on the LLM-translated Romanian version of SemEval-2010 Task 8 being a faithful proxy. No human validation, entity-span mismatch rates, relation-label drift statistics, or even spot-check accuracy is reported for the translation pipeline used in both training and test data. This is load-bearing for the cross-lingual claims.
Authors: We agree that the absence of reported validation metrics for the LLM translation pipeline is a limitation, as the quality of the Romanian data directly supports all headline cross-lingual results. The current manuscript describes the pipeline in §3 but provides no human evaluation, entity-span accuracy, label drift statistics, or spot-checks. In the revised version we will add a dedicated validation subsection reporting human assessment on a stratified sample of 200 translated instances (covering both train and test splits), including entity-span preservation rate, relation-label consistency, and overall fidelity scores. This will be presented alongside the existing pipeline description. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with no derivations or self-referential reductions
full rationale
The manuscript is an empirical evaluation study that translates an existing English benchmark via LLM pipeline, runs zero/few-shot and QLoRA experiments on Gemma, and compares against encoder baselines. All reported metrics (macro F1 deltas, cross-lingual gaps, model size comparisons) are direct experimental outcomes on held-out test data. No equations, fitted parameters, uniqueness theorems, or ansatzes are defined inside the work and then re-used as predictions. No self-citations serve as load-bearing justification for core claims. The translation fidelity concern is a validity issue for the experimental setup, not a circularity in any derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard evaluation metrics such as macro F1 are appropriate and sufficient for comparing relation extraction systems.
Reference graph
Works this paper leans on
-
[1]
SemEval- 2010 task 8: Multi-way classification of semantic relations between pairs of nominals,
I. Hendrickx, S. N. Kim, Z. Kozareva, P. Nakov, D. ´O S ´eaghdha, S. Pad ´o, M. Pennacchiotti, L. Romano, and S. Szpakowicz, “SemEval- 2010 task 8: Multi-way classification of semantic relations between pairs of nominals,” inProc. SemEval Workshop at ACL, 2010, pp. 33–38
2010
-
[2]
The claude 3 model family: Opus, sonnet, haiku,
Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” An- thropic, Tech. Rep., 2024
2024
-
[3]
Gemma 4 technical report,
Google DeepMind, “Gemma 4 technical report,” Google DeepMind, Tech. Rep., Apr. 2026, open-weight model release
2026
-
[4]
QLoRA: Efficient finetuning of quantized language models,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized language models,” inProc. NeurIPS, 2023
2023
-
[5]
Kernel methods for relation extraction,
D. Zelenko, C. Aone, and A. Richardella, “Kernel methods for relation extraction,”Journal of Machine Learning Research, vol. 3, pp. 1083– 1106, 2003
2003
-
[6]
Relation classification via convolutional deep neural network,
D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, “Relation classification via convolutional deep neural network,” inProc. COLING, 2014, pp. 2335–2344
2014
-
[7]
Entity, relation, and event extraction with contextualized span representations,
D. Wadden, U. Wennberg, Y . Luan, and H. Hajishirzi, “Entity, relation, and event extraction with contextualized span representations,” inProc. EMNLP-IJCNLP, 2019, pp. 5784–5789
2019
-
[8]
Match- ing the blanks: Distributional similarity matters for relation learning,
L. Baldini Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski, “Match- ing the blanks: Distributional similarity matters for relation learning,” inProc. ACL, 2019, pp. 2895–2905
2019
-
[9]
Structured prediction as translation between augmented natural languages,
G. Paolini, B. Athiwaratkun, J. Krone, J. Ma, A. Achille, R. Anubhai, C. N. dos Santos, B. Xiang, and S. Soatto, “Structured prediction as translation between augmented natural languages,” inProc. ICLR, 2021
2021
-
[10]
Un- supervised cross-lingual representation learning at scale,
A. Conneau, K. Khandelwal, N. Goyal, V . Chaudhary, G. Wenzek, F. Guzm ´an, E. Grave, M. Ott, L. Zettlemoyer, and V . Stoyanov, “Un- supervised cross-lingual representation learning at scale,” inProc. ACL, 2020, pp. 8440–8451
2020
-
[11]
Multilingual open relation extraction using cross-lingual projection,
M. Faruqui and S. Kumar, “Multilingual open relation extraction using cross-lingual projection,” inProc. NAACL-HLT, 2015, pp. 1351–1356
2015
-
[12]
Zero-shot information extraction via chatting with ChatGPT,
X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, S. Huang, P. Xie, J. Xu, Y . Chen, M. Zhang, Y . Jiang, and W. Han, “Zero-shot information extraction via chatting with ChatGPT,”arXiv preprint arXiv:2302.10205, 2023
-
[13]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inProc. ICLR, 2022
2022
-
[14]
The birth of Romanian BERT,
S. D. Dumitrescu, A.-M. Avram, and S. Pyysalo, “The birth of Romanian BERT,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4324–4328
2020
-
[15]
RoBERT – a Romanian BERT model,
M. Masala, S. Ruseti, and M. Dasc ˘alu, “RoBERT – a Romanian BERT model,” inProc. COLING, 2020, pp. 6626–6637
2020
-
[16]
Introducing RONEC – the Romanian named entity corpus,
S. D. Dumitrescu and A.-M. Avram, “Introducing RONEC – the Romanian named entity corpus,” inProc. LREC, 2020, pp. 4436–4443
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.