Just Use XML: Revisiting Joint Translation and Label Projection
Pith reviewed 2026-05-15 11:41 UTC · model grok-4.3
The pith
LabelPigeon jointly translates text and projects labels using XML tags, improving both translation quality and downstream cross-lingual performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using XML tags to indicate label spans during training, LabelPigeon performs translation and label projection in one model. This joint approach outperforms separate projection methods and even enhances translation quality in multiple languages due to the additional fine-tuning on tagged data. Evaluations across many languages and tasks confirm better cross-lingual transfer, with notable improvements in named entity recognition.
What carries the argument
LabelPigeon framework that inserts XML tags around annotated spans to jointly train a model for translation and label projection.
If this is right
- Joint XML projection avoids the translation quality degradation reported in prior joint methods.
- Additional fine-tuning on XML-tagged data consistently improves translation across 203 languages.
- Cross-lingual transfer sees substantial gains on NER, POS tagging, and other tasks.
- Direct evaluation schemes can better isolate the benefits of joint modeling.
Where Pith is reading between the lines
- Structured markup like XML could guide models in other constrained generation tasks beyond label projection.
- The approach might generalize to projecting other types of annotations without custom pipelines.
- Reducing separate steps could simplify NLP workflows for low-resource language applications.
Load-bearing premise
The direct evaluation scheme isolates the joint XML method's benefit without bias from the specific tagging format or the extra fine-tuning.
What would settle it
Running the same fine-tuning but using a non-XML format for labels and observing if performance gains remain would test if XML specifically enables the joint benefit.
Figures
read the original abstract
Label projection is an effective technique for cross-lingual transfer, extending span-annotated datasets from a high-resource language to low-resource ones. Most approaches perform label projection as a separate step after machine translation, and prior work that combines the two reports degraded translation quality. We re-evaluate this claim with LabelPigeon, a novel framework that jointly performs translation and label projection via XML tags. We design a direct evaluation scheme for label projection, and find that LabelPigeon outperforms baselines and actively improves translation quality in 11 languages. We further assess translation quality across 203 languages and varying annotation complexity, finding consistent improvement attributed to additional fine-tuning. Finally, across 27 languages and three downstream tasks, we report substantial gains in cross-lingual transfer over comparable work, up to +40.2 F1 on NER. Overall, our results demonstrate that XML-tagged label projection provides effective and efficient label transfer without compromising translation quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LabelPigeon, a framework that jointly performs machine translation and label projection for cross-lingual transfer by embedding span annotations in XML tags. It re-evaluates prior claims that joint approaches degrade translation quality, introduces a direct evaluation scheme for label projection, and reports that the method outperforms baselines while improving translation quality in 11 languages. Additional results show consistent gains from fine-tuning across 203 languages and substantial improvements in downstream cross-lingual transfer (up to +40.2 F1 on NER) across 27 languages and three tasks.
Significance. If the central results hold, the work is significant for cross-lingual NLP because it provides an efficient, single-step alternative to separate translation-plus-projection pipelines and challenges the assumption that joint modeling necessarily harms MT quality. The scale of the evaluation (203 languages for translation, 27 for downstream tasks) and the focus on practical label transfer make the findings potentially impactful for low-resource settings if the evaluation controls are tightened.
major comments (2)
- [Experiments section] Direct evaluation scheme (Experiments section): the scheme does not include matched controls that apply identical fine-tuning to non-XML baselines or to data projected after separate translation. Without these, the reported gains cannot be unambiguously attributed to the joint XML construction rather than to the tagging format or the extra fine-tuning step itself, which is load-bearing for the claim that LabelPigeon 'actively improves translation quality' and outperforms prior joint methods.
- [Results on Translation Quality] Translation quality results (across 11 languages and 203-language scale): the manuscript attributes consistent improvements to 'additional fine-tuning' but provides no ablation that isolates the XML joint objective from the fine-tuning procedure or from possible data-selection effects, weakening the causal link between the proposed joint XML approach and the observed MT gains.
minor comments (2)
- [Abstract] Abstract and results tables: error bars or standard deviations are not reported for the F1 gains or translation metrics, which would help assess the stability of the +40.2 F1 claim and the 'consistent' improvements across scales.
- [Experiments] Baseline descriptions: more explicit details on how the comparable prior joint methods were re-implemented (hyperparameters, exact XML formatting) would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. The comments highlight important aspects of our experimental design that we will address in the revision to strengthen the attribution of our results to the proposed joint XML approach.
read point-by-point responses
-
Referee: [Experiments section] Direct evaluation scheme (Experiments section): the scheme does not include matched controls that apply identical fine-tuning to non-XML baselines or to data projected after separate translation. Without these, the reported gains cannot be unambiguously attributed to the joint XML construction rather than to the tagging format or the extra fine-tuning step itself, which is load-bearing for the claim that LabelPigeon 'actively improves translation quality' and outperforms prior joint methods.
Authors: We agree that matched controls with identical fine-tuning applied to non-XML baselines would provide stronger evidence for attributing gains specifically to the joint XML construction. In the revised manuscript, we will add experiments that apply the same fine-tuning procedure to standard MT models without XML tags and to separately translated data with post-hoc projection, enabling direct isolation of the joint objective's contribution. revision: yes
-
Referee: [Results on Translation Quality] Translation quality results (across 11 languages and 203-language scale): the manuscript attributes consistent improvements to 'additional fine-tuning' but provides no ablation that isolates the XML joint objective from the fine-tuning procedure or from possible data-selection effects, weakening the causal link between the proposed joint XML approach and the observed MT gains.
Authors: We acknowledge that an explicit ablation isolating the XML joint objective from fine-tuning and data-selection effects would strengthen the causal claims. We will include such an ablation in the revision, comparing the full LabelPigeon model to variants trained under identical fine-tuning regimes but without the joint XML projection objective, while controlling for data selection. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper proposes LabelPigeon as an XML-based joint translation and label projection framework and evaluates it empirically against external baselines on translation quality and downstream tasks (NER, etc.) across 27+ languages. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citation chains appear in the provided text. Claims of outperformance (+40.2 F1, etc.) are direct comparisons to prior work and are externally falsifiable, satisfying the self-contained benchmark criterion for a score of 0.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning hyperparameters
axioms (1)
- domain assumption A machine translation model can learn to output XML tags in correct positions while producing fluent translations
Reference graph
Works this paper leans on
-
[1]
Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. InPro- ceedings of the 53rd Annual Meeting of the Asso- ciation for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 397–407, Beijing, China. Association for Computational Lin- guistics. Mary...
work page 2017
-
[2]
On the Cross-lingual Transferability of Mono- lingual Representations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. Mona Baker. 1993. Corpus Linguistics and Translation Studies — Implications and Applications. InText and Technology, pages 233–...
work page 1993
-
[3]
Transferring structural markup across transla- tions using multilingual alignment and projection. In Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, pages 11–20, New York, NY , USA. Association for Computing Machinery. Semere Kiros Bitew, Johannes Deleu, Chris Develder, and Thomas Demeester. 2021. Lazy Low-Resource Corefere...
-
[4]
T-Projection: High Quality Annotation Projec- tion for Sequence Labeling Tasks. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15203–15217, Singapore. Association for Computational Linguistics. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng- Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Kr- ishnan, Marc’Aurelio Ranzato, Fra...
work page 2023
-
[5]
Statistical Power and Translationese in Ma- chine Translation Evaluation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, On- line. Association for Computational Linguistics. Greg Hanneman and Georgiana Dinu. 2020. How Should Markup Tags Be Translated? InProceed- ings of the Fifth Conference o...
work page 2020
-
[6]
InProceedings of the 37th International Conference on Machine Learning, pages 4411–4421
XTREME: A Massively Multilingual Multi- task Benchmark for Evaluating Cross-lingual Gener- alisation. InProceedings of the 37th International Conference on Machine Learning, pages 4411–4421. PMLR. Eric Joanis, Darlene Stewart, Samuel Larkin, and Roland Kuhn. 2013. Transferring markup tags in sta- tistical machine translation: A two-stream approach. InProc...
work page 2013
-
[7]
InThe Twelfth International Conference on Learning Representations, Vienna, Austria
Constrained decoding for cross-lingual label projection. InThe Twelfth International Conference on Learning Representations, Vienna, Austria. Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. MLQA: Evalu- ating Cross-lingual Extractive Question Answering. InProceedings of the 58th Annual Meeting of the As- sociation for ...
work page 2020
-
[8]
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (V olume 1: Long Papers), pages 4322–4337, Mexico City, Mexico. Association for Computational Linguistics. Mehrad Moradshahi, Gi...
work page 2024
-
[9]
CorefUD 1.0: Coreference Meets Universal Dependencies. InProceedings of the Thirteenth Lan- guage Resources and Evaluation Conference, pages 4859–4872, Marseille, France. European Language Resources Association. Jian Ni, Georgiana Dinu, and Radu Florian. 2017. Weakly Supervised Cross-Lingual Named Entity Recognition via Effective Annotation and Represen- ...
work page 2017
-
[10]
A Controlled Reevaluation of Coreference Resolution Models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 256–263, Torino, Italia. ELRA and ICCL. Ondˇrej Pražák, Miloslav Konopík, and Jakub Sido. 2021. Multilingual Coreference Resolution with Harmo- niz...
work page 2024
-
[11]
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. InProceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovi...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Association for Computational Linguistics
Are LLMs Good Annotators for Discourse- level Event Relation Extraction? InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1–19, Miami, Florida, USA. Association for Computational Linguistics. Weischedel, Ralph, Palmer, Martha, Marcus, Mitchell, Hovy, Eduard, Pradhan, Sameer, Ramshaw, Lance, Xue, Nianwen, Taylor, Ann, Kaufman,...
work page 2024
-
[13]
The model is trained with the hyperparameters given in Table 4. We note that since the original dataset includes examples with multiple instances of the same tag, the total number of tags is higher than the unique number of tags. As we filter out instances without tags, the minimum number of tags is 1 for all training data subsets. A.1 Ablations The full ...
work page 2025
-
[14]
The resulting dataset statistics are compiled in Table 8
<80 . The resulting dataset statistics are compiled in Table 8. We note that for the down- stream evaluation in §7, we use the full MLQA dataset as the filtering is not necessary for question- answering evaluation. B.2 Label Projection into English As label projection is largely applied for translating labeled data for low-resource languages, we focus on ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.