Just Use XML: Revisiting Joint Translation and Label Projection

Chris Biemann; Hans Ole Hatzel; Thennal DK

arxiv: 2603.12021 · v2 · submitted 2026-03-12 · 💻 cs.CL · cs.AI

Just Use XML: Revisiting Joint Translation and Label Projection

Thennal DK , Chris Biemann , Hans Ole Hatzel This is my paper

Pith reviewed 2026-05-15 11:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords label projectionmachine translationcross-lingual transferXML tagsnamed entity recognitionjoint modelinglow-resource languages

0 comments

The pith

LabelPigeon jointly translates text and projects labels using XML tags, improving both translation quality and downstream cross-lingual performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that label projection, used to transfer annotations like named entities from high-resource to low-resource languages, does not need to be done as a separate step after machine translation. Instead, LabelPigeon integrates the two by marking label spans with XML tags in the input, allowing a translation model to learn both tasks together. This matters for building NLP tools in many languages where creating labeled data from scratch is expensive. Experiments show it not only avoids hurting translation but improves it in 11 languages, and leads to large gains on tasks like NER across 27 languages.

Core claim

By using XML tags to indicate label spans during training, LabelPigeon performs translation and label projection in one model. This joint approach outperforms separate projection methods and even enhances translation quality in multiple languages due to the additional fine-tuning on tagged data. Evaluations across many languages and tasks confirm better cross-lingual transfer, with notable improvements in named entity recognition.

What carries the argument

LabelPigeon framework that inserts XML tags around annotated spans to jointly train a model for translation and label projection.

If this is right

Joint XML projection avoids the translation quality degradation reported in prior joint methods.
Additional fine-tuning on XML-tagged data consistently improves translation across 203 languages.
Cross-lingual transfer sees substantial gains on NER, POS tagging, and other tasks.
Direct evaluation schemes can better isolate the benefits of joint modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured markup like XML could guide models in other constrained generation tasks beyond label projection.
The approach might generalize to projecting other types of annotations without custom pipelines.
Reducing separate steps could simplify NLP workflows for low-resource language applications.

Load-bearing premise

The direct evaluation scheme isolates the joint XML method's benefit without bias from the specific tagging format or the extra fine-tuning.

What would settle it

Running the same fine-tuning but using a non-XML format for labels and observing if performance gains remain would test if XML specifically enables the joint benefit.

Figures

Figures reproduced from arXiv: 2603.12021 by Chris Biemann, Hans Ole Hatzel, Thennal DK.

**Figure 1.** Figure 1: An example taken from XQuAD (Artetxe et al., 2020), where LabelPigeon accurately and seamlessly handles translating English to German while transferring 7 labeled spans with nesting. complexity, finding consistent improvement which we attribute to the additional fine-tuning (§6). Finally, we conduct downstream experiments on 3 NLP tasks across 27 languages, showcasing that LabelPigeon consistently outperf… view at source ↗

**Figure 2.** Figure 2: Examples of labeled English sentences with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example showcasing the tag swap that we [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Translation performance of our model on FLORES-200 as measured by chrF++ across different values of Pclose and Popen under the Complex marker insertion scheme. C.2 Full FLORES-200 Results We provide the full results of our FLORES-200 experiments in Tables 10 and 11. We note that the performance improvement of the fine-tuned models are largely consistent across all languages, the vast majority of which are … view at source ↗

read the original abstract

Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +40.2 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LabelPigeon shows XML joint projection can lift both translation quality and downstream transfer at scale, but the gains may partly trace to fine-tuning rather than the joint format alone.

read the letter

The main thing to know is that this paper makes a practical case for doing translation and label projection together by wrapping labels in XML tags inside the MT model. They call it LabelPigeon and report that it beats separate-step baselines while also improving translation quality in 11 languages, with consistent effects across 203 languages and solid downstream gains up to +40 F1 on NER for 27 languages and three tasks.

Referee Report

2 major / 2 minor

Summary. The paper introduces LabelPigeon, a framework that jointly performs machine translation and label projection for cross-lingual transfer by embedding span annotations in XML tags. It re-evaluates prior claims that joint approaches degrade translation quality, introduces a direct evaluation scheme for label projection, and reports that the method outperforms baselines while improving translation quality in 11 languages. Additional results show consistent gains from fine-tuning across 203 languages and substantial improvements in downstream cross-lingual transfer (up to +40.2 F1 on NER) across 27 languages and three tasks.

Significance. If the central results hold, the work is significant for cross-lingual NLP because it provides an efficient, single-step alternative to separate translation-plus-projection pipelines and challenges the assumption that joint modeling necessarily harms MT quality. The scale of the evaluation (203 languages for translation, 27 for downstream tasks) and the focus on practical label transfer make the findings potentially impactful for low-resource settings if the evaluation controls are tightened.

major comments (2)

[Experiments section] Direct evaluation scheme (Experiments section): the scheme does not include matched controls that apply identical fine-tuning to non-XML baselines or to data projected after separate translation. Without these, the reported gains cannot be unambiguously attributed to the joint XML construction rather than to the tagging format or the extra fine-tuning step itself, which is load-bearing for the claim that LabelPigeon 'actively improves translation quality' and outperforms prior joint methods.
[Results on Translation Quality] Translation quality results (across 11 languages and 203-language scale): the manuscript attributes consistent improvements to 'additional fine-tuning' but provides no ablation that isolates the XML joint objective from the fine-tuning procedure or from possible data-selection effects, weakening the causal link between the proposed joint XML approach and the observed MT gains.

minor comments (2)

[Abstract] Abstract and results tables: error bars or standard deviations are not reported for the F1 gains or translation metrics, which would help assess the stability of the +40.2 F1 claim and the 'consistent' improvements across scales.
[Experiments] Baseline descriptions: more explicit details on how the comparable prior joint methods were re-implemented (hyperparameters, exact XML formatting) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. The comments highlight important aspects of our experimental design that we will address in the revision to strengthen the attribution of our results to the proposed joint XML approach.

read point-by-point responses

Referee: [Experiments section] Direct evaluation scheme (Experiments section): the scheme does not include matched controls that apply identical fine-tuning to non-XML baselines or to data projected after separate translation. Without these, the reported gains cannot be unambiguously attributed to the joint XML construction rather than to the tagging format or the extra fine-tuning step itself, which is load-bearing for the claim that LabelPigeon 'actively improves translation quality' and outperforms prior joint methods.

Authors: We agree that matched controls with identical fine-tuning applied to non-XML baselines would provide stronger evidence for attributing gains specifically to the joint XML construction. In the revised manuscript, we will add experiments that apply the same fine-tuning procedure to standard MT models without XML tags and to separately translated data with post-hoc projection, enabling direct isolation of the joint objective's contribution. revision: yes
Referee: [Results on Translation Quality] Translation quality results (across 11 languages and 203-language scale): the manuscript attributes consistent improvements to 'additional fine-tuning' but provides no ablation that isolates the XML joint objective from the fine-tuning procedure or from possible data-selection effects, weakening the causal link between the proposed joint XML approach and the observed MT gains.

Authors: We acknowledge that an explicit ablation isolating the XML joint objective from fine-tuning and data-selection effects would strengthen the causal claims. We will include such an ablation in the revision, comparing the full LabelPigeon model to variants trained under identical fine-tuning regimes but without the joint XML projection objective, while controlling for data selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper proposes LabelPigeon as an XML-based joint translation and label projection framework and evaluates it empirically against external baselines on translation quality and downstream tasks (NER, etc.) across 27+ languages. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citation chains appear in the provided text. Claims of outperformance (+40.2 F1, etc.) are direct comparisons to prior work and are externally falsifiable, satisfying the self-contained benchmark criterion for a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a translation model can be trained to preserve XML tags without quality loss, plus standard training hyperparameters that are fitted to data.

free parameters (1)

fine-tuning hyperparameters
Improvements are attributed to additional fine-tuning, implying parameters chosen or fitted during training.

axioms (1)

domain assumption A machine translation model can learn to output XML tags in correct positions while producing fluent translations
This is the core premise enabling the joint framework to avoid degrading translation quality.

pith-pipeline@v0.9.0 · 5455 in / 1179 out tokens · 53092 ms · 2026-05-15T11:41:26.154020+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. InPro- ceedings of the 53rd Annual Meeting of the Asso- ciation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 397–407, Beijing, China. Association for Computational Lin- guistics. Mary...

work page 2017
[2]

InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online

On the Cross-lingual Transferability of Mono- lingual Representations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. Mona Baker. 1993. Corpus Linguistics and Translation Studies — Implications and Applications. InText and Technology, pages 233–...

work page 1993
[3]

In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pages 11–20, New York, NY , USA

Transferring structural markup across transla- tions using multilingual alignment and projection. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pages 11–20, New York, NY , USA. Association for Computing Machinery. Semere Kiros Bitew, Johannes Deleu, Chris Develder, and Thomas Demeester. 2021. Lazy Low-Resource Corefere...

work page arXiv 2021
[4]

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15203–15217, Singapore

T-Projection: High Quality Annotation Projec- tion for Sequence Labeling Tasks. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15203–15217, Singapore. Association for Computational Linguistics. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, Marc’Aurelio Ranzato, Fra...

work page 2023
[5]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, On- line

Statistical Power and Translationese in Ma- chine Translation Evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, On- line. Association for Computational Linguistics. Greg Hanneman and Georgiana Dinu. 2020. How Should Markup Tags Be Translated? InProceed- ings of the Fifth Conference o...

work page 2020
[6]

InProceedings of the 37th International Conference on Machine Learning, pages 4411–4421

XTREME: A Massively Multilingual Multi- task Benchmark for Evaluating Cross-lingual Gener- alisation. InProceedings of the 37th International Conference on Machine Learning, pages 4411–4421. PMLR. Eric Joanis, Darlene Stewart, Samuel Larkin, and Roland Kuhn. 2013. Transferring markup tags in sta- tistical machine translation: A two-stream approach. InProc...

work page 2013
[7]

InThe Twelfth International Conference on Learning Representations, Vienna, Austria

Constrained decoding for cross-lingual label projection. InThe Twelfth International Conference on Learning Representations, Vienna, Austria. Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. MLQA: Evalu- ating Cross-lingual Extractive Question Answering. InProceedings of the 58th Annual Meeting of the As- sociation for ...

work page 2020
[8]

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (V olume 1: Long Papers), pages 4322–4337, Mexico City, Mexico. Association for Computational Linguistics. Mehrad Moradshahi, Gi...

work page 2024
[9]

InProceedings of the Thirteenth Lan- guage Resources and Evaluation Conference, pages 4859–4872, Marseille, France

CorefUD 1.0: Coreference Meets Universal Dependencies. InProceedings of the Thirteenth Lan- guage Resources and Evaluation Conference, pages 4859–4872, Marseille, France. European Language Resources Association. Jian Ni, Georgiana Dinu, and Radu Florian. 2017. Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Represen- ...

work page 2017
[10]

InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 256–263, Torino, Italia

A Controlled Reevaluation of Coreference Resolution Models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 256–263, Torino, Italia. ELRA and ICCL. Ondˇrej Pražák, Miloslav Konopík, and Jakub Sido. 2021. Multilingual Coreference Resolution with Harmo- niz...

work page 2024
[11]

Gemma 3 Technical Report

COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovi...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Association for Computational Linguistics

Are LLMs Good Annotators for Discourse- level Event Relation Extraction? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1–19, Miami, Florida, USA. Association for Computational Linguistics. Weischedel, Ralph, Palmer, Martha, Marcus, Mitchell, Hovy, Eduard, Pradhan, Sameer, Ramshaw, Lance, Xue, Nianwen, Taylor, Ann, Kaufman,...

work page 2024
[13]

We note that since the original dataset includes examples with multiple instances of the same tag, the total number of tags is higher than the unique number of tags

The model is trained with the hyperparameters given in Table 4. We note that since the original dataset includes examples with multiple instances of the same tag, the total number of tags is higher than the unique number of tags. As we filter out instances without tags, the minimum number of tags is 1 for all training data subsets. A.1 Ablations The full ...

work page 2025
[14]

The resulting dataset statistics are compiled in Table 8

<80 . The resulting dataset statistics are compiled in Table 8. We note that for the down- stream evaluation in §7, we use the full MLQA dataset as the filtering is not necessary for question- answering evaluation. B.2 Label Projection into English As label projection is largely applied for translating labeled data for low-resource languages, we focus on ...

work page 2024

[1] [1]

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. InPro- ceedings of the 53rd Annual Meeting of the Asso- ciation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 397–407, Beijing, China. Association for Computational Lin- guistics. Mary...

work page 2017

[2] [2]

InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online

On the Cross-lingual Transferability of Mono- lingual Representations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. Mona Baker. 1993. Corpus Linguistics and Translation Studies — Implications and Applications. InText and Technology, pages 233–...

work page 1993

[3] [3]

In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pages 11–20, New York, NY , USA

Transferring structural markup across transla- tions using multilingual alignment and projection. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pages 11–20, New York, NY , USA. Association for Computing Machinery. Semere Kiros Bitew, Johannes Deleu, Chris Develder, and Thomas Demeester. 2021. Lazy Low-Resource Corefere...

work page arXiv 2021

[4] [4]

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15203–15217, Singapore

T-Projection: High Quality Annotation Projec- tion for Sequence Labeling Tasks. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15203–15217, Singapore. Association for Computational Linguistics. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, Marc’Aurelio Ranzato, Fra...

work page 2023

[5] [5]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, On- line

Statistical Power and Translationese in Ma- chine Translation Evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, On- line. Association for Computational Linguistics. Greg Hanneman and Georgiana Dinu. 2020. How Should Markup Tags Be Translated? InProceed- ings of the Fifth Conference o...

work page 2020

[6] [6]

InProceedings of the 37th International Conference on Machine Learning, pages 4411–4421

XTREME: A Massively Multilingual Multi- task Benchmark for Evaluating Cross-lingual Gener- alisation. InProceedings of the 37th International Conference on Machine Learning, pages 4411–4421. PMLR. Eric Joanis, Darlene Stewart, Samuel Larkin, and Roland Kuhn. 2013. Transferring markup tags in sta- tistical machine translation: A two-stream approach. InProc...

work page 2013

[7] [7]

InThe Twelfth International Conference on Learning Representations, Vienna, Austria

Constrained decoding for cross-lingual label projection. InThe Twelfth International Conference on Learning Representations, Vienna, Austria. Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. MLQA: Evalu- ating Cross-lingual Extractive Question Answering. InProceedings of the 58th Annual Meeting of the As- sociation for ...

work page 2020

[8] [8]

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (V olume 1: Long Papers), pages 4322–4337, Mexico City, Mexico. Association for Computational Linguistics. Mehrad Moradshahi, Gi...

work page 2024

[9] [9]

InProceedings of the Thirteenth Lan- guage Resources and Evaluation Conference, pages 4859–4872, Marseille, France

CorefUD 1.0: Coreference Meets Universal Dependencies. InProceedings of the Thirteenth Lan- guage Resources and Evaluation Conference, pages 4859–4872, Marseille, France. European Language Resources Association. Jian Ni, Georgiana Dinu, and Radu Florian. 2017. Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Represen- ...

work page 2017

[10] [10]

InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 256–263, Torino, Italia

A Controlled Reevaluation of Coreference Resolution Models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 256–263, Torino, Italia. ELRA and ICCL. Ondˇrej Pražák, Miloslav Konopík, and Jakub Sido. 2021. Multilingual Coreference Resolution with Harmo- niz...

work page 2024

[11] [11]

Gemma 3 Technical Report

COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovi...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Association for Computational Linguistics

Are LLMs Good Annotators for Discourse- level Event Relation Extraction? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1–19, Miami, Florida, USA. Association for Computational Linguistics. Weischedel, Ralph, Palmer, Martha, Marcus, Mitchell, Hovy, Eduard, Pradhan, Sameer, Ramshaw, Lance, Xue, Nianwen, Taylor, Ann, Kaufman,...

work page 2024

[13] [13]

We note that since the original dataset includes examples with multiple instances of the same tag, the total number of tags is higher than the unique number of tags

The model is trained with the hyperparameters given in Table 4. We note that since the original dataset includes examples with multiple instances of the same tag, the total number of tags is higher than the unique number of tags. As we filter out instances without tags, the minimum number of tags is 1 for all training data subsets. A.1 Ablations The full ...

work page 2025

[14] [14]

The resulting dataset statistics are compiled in Table 8

<80 . The resulting dataset statistics are compiled in Table 8. We note that for the down- stream evaluation in §7, we use the full MLQA dataset as the filtering is not necessary for question- answering evaluation. B.2 Label Projection into English As label projection is largely applied for translating labeled data for low-resource languages, we focus on ...

work page 2024