arxiv: 2512.07538 · v3 · submitted 2025-12-08 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

Michelle Wastl , Jannis Vamvas , Rico Sennrich

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords semantic difference recognitioncross-lingual benchmarktoken-level annotationdocument-level evaluationlarge language modelsencoder modelsmultilingual semantic alignment

0 comments

The pith

A new benchmark shows that LLMs and encoder models perform poorly on token-level semantic difference recognition in cross-lingual documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SwissGov-RSD, a collection of 224 multi-parallel Swiss government documents in English paired with German, French, or Italian, each carrying human annotations that mark semantic differences at the token level. It then tests a range of open-source and closed-source large language models plus encoder models under varied fine-tuning conditions. The evaluations demonstrate substantially weaker results than those models achieve on monolingual, sentence-level, or synthetic versions of similar tasks. If the reported gap holds, it means current automatic methods are not yet equipped to handle the kinds of naturalistic cross-lingual variation that arise in real document collections, limiting their usefulness for text generation evaluation and content alignment.

Core claim

The authors introduce SwissGov-RSD as the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition, consisting of 224 multi-parallel documents across three language pairs with token-level human annotations, and show that both LLMs and encoder models achieve considerably lower performance on this benchmark than on monolingual, sentence-level, and synthetic alternatives.

What carries the argument

SwissGov-RSD, the human-annotated dataset of multi-parallel government documents that supplies token-level labels for semantic differences across languages.

If this is right

Text generation evaluation metrics will need to incorporate document-level cross-lingual checks to remain reliable.
Content alignment systems for multilingual corpora must be validated on naturalistic data rather than synthetic or monolingual proxies.
Model training objectives should target token-level semantic distinctions that appear only when documents are compared across languages.
Benchmarking practices for LLMs should include cross-lingual document pairs to avoid overestimating readiness for practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation approach could be applied to other language pairs or domains to test whether the performance gap is specific to government text or more general.
Closing the gap on this benchmark might directly improve the quality of multilingual summarization and translation quality estimation.
Future work could explore whether the dataset's structure supports new pre-training signals that emphasize cross-lingual semantic invariance at the token level.

Load-bearing premise

Human annotators can reliably and consistently identify all meaningful token-level semantic differences, and the chosen Swiss government documents represent typical cross-lingual variation in real documents.

What would settle it

A model or training procedure that reaches performance levels comparable to its monolingual or sentence-level results when evaluated on the SwissGov-RSD test set would falsify the claimed performance gap.

Figures

Figures reproduced from arXiv: 2512.07538 by Jannis Vamvas, Michelle Wastl, Rico Sennrich.

**Figure 1.** Figure 1: Excerpt from an English-German document pair from the SwissGov-RSD dataset, annotated with token-level differences. The differences that we found range from explicitations to omitted paragraphs. The paragraph marked in deep red contains information about emergency calls and is completely omitted in the English document. Motivated by these points, we collect documents with naturally occurring semantic diffe… view at source ↗

**Figure 2.** Figure 2: Architectures used in our experiments. An [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Label distribution of the final SwissGov-RSD dataset in tokens (separated by white spaces). 0-labeled [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Excerpt of an EN-DE document pair with gold labels and predictions one model from each of the system [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Average Spearman correlation coefficient at [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Label distribution for all languages by each annotator in the trial phase. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Label distribution for all languages by each annotator in the main phase. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Different annotation strategies: same difference, but one annotator labels the whole phrase while the other [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Different annotation strategies: One annotator does not mark omissions or additions, while the other does. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Example of an automatically created augmentation based on three individual, randomly selected sentence [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

Recognizing semantic differences across documents is crucial for text generation evaluation and content alignment, especially in cross-lingual settings. However, as a standalone task, it has received little attention. We address this by introducing SwissGov-RSD, the first naturalistic, document-level, cross-lingual dataset for semantic difference recognition. It encompasses a total of 224 multi-parallel documents in English--German, English--French, and English--Italian with token-level difference annotations by human annotators. We evaluate a variety of open-source and closed-source large language models as well as encoder models across different fine-tuning settings on this new benchmark. Our results show that current automatic approaches perform poorly compared to their performance on monolingual, sentence-level, and synthetic benchmarks, revealing a considerable gap for both LLMs and encoder models. We make our code and dataset publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a new naturalistic cross-lingual document benchmark with token annotations and shows models lag behind their results on simpler tasks, though missing agreement metrics weaken the gap claim a bit.

read the letter

The main takeaway is that SwissGov-RSD supplies the first document-level, cross-lingual dataset with human token annotations for semantic differences, drawn from 224 real Swiss government texts across English-German, English-French, and English-Italian. Models, both LLMs and encoders, perform noticeably worse here than on the monolingual, sentence-level, or synthetic benchmarks the authors reference.

Referee Report

2 major / 2 minor

Summary. The paper introduces SwissGov-RSD, the first naturalistic document-level cross-lingual benchmark for token-level semantic difference recognition. It consists of 224 multi-parallel Swiss government documents in English--German, English--French, and English--Italian pairs, with human token-level difference annotations. The authors evaluate open- and closed-source LLMs plus encoder models across fine-tuning settings and report that current automatic approaches perform substantially worse than on monolingual, sentence-level, or synthetic benchmarks, revealing a considerable gap.

Significance. If the annotations are reliable, the work provides a valuable new resource for an underexplored task relevant to text generation evaluation and cross-lingual content alignment. The public release of the dataset and code is a clear strength that supports reproducibility and future model development. The empirical finding of a performance gap on naturalistic data could usefully guide research priorities, provided the gold-standard quality is established.

major comments (2)

[§3] §3 (Dataset Construction and Annotation): No inter-annotator agreement statistics, annotation guidelines, or disagreement-resolution procedure are reported. This is load-bearing for the central claim, because the reported performance gap for LLMs and encoders is interpreted as evidence that models struggle with naturalistic cross-lingual semantic differences; without IAA or a validation subset, low scores could partly reflect annotation noise rather than model shortcomings.
[§4] §4 (Experiments and Results): The cross-benchmark comparison (monolingual/synthetic vs. SwissGov-RSD) does not control for document length, domain specificity, or annotation granularity differences. This weakens the attribution of the gap specifically to the cross-lingual naturalistic setting.

minor comments (2)

[Table 1] Table 1 or equivalent: Clarify the exact distribution of documents across the three language pairs and the total number of annotated tokens to allow readers to assess scale immediately.
[§2] §2 (Related Work): A brief discussion of how token-level semantic difference annotation differs from standard semantic textual similarity or entailment tasks would help readers appreciate the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SwissGov-RSD. The comments highlight important aspects of annotation reliability and comparative analysis that we address below. We provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction and Annotation): No inter-annotator agreement statistics, annotation guidelines, or disagreement-resolution procedure are reported. This is load-bearing for the central claim, because the reported performance gap for LLMs and encoders is interpreted as evidence that models struggle with naturalistic cross-lingual semantic differences; without IAA or a validation subset, low scores could partly reflect annotation noise rather than model shortcomings.

Authors: We agree that inter-annotator agreement (IAA) statistics, annotation guidelines, and the disagreement-resolution procedure are essential to establish annotation quality and support the interpretation of the performance gap. The current manuscript does not report these details. In the revised version, we will include the full annotation guidelines as supplementary material, report IAA using token-level agreement metrics (e.g., Krippendorff's alpha or pairwise F1), and describe the resolution process (e.g., adjudication by a third annotator). We will also note any validation subset used during annotation. revision: yes
Referee: [§4] §4 (Experiments and Results): The cross-benchmark comparison (monolingual/synthetic vs. SwissGov-RSD) does not control for document length, domain specificity, or annotation granularity differences. This weakens the attribution of the gap specifically to the cross-lingual naturalistic setting.

Authors: We acknowledge that the benchmarks differ in document length, domain, and annotation granularity, and that these factors are not explicitly controlled in the current comparisons. These differences are inherent to contrasting controlled synthetic/sentence-level settings with naturalistic document-level cross-lingual data. To strengthen the analysis, we will add a dedicated discussion section in the revision that explicitly addresses these confounders, include length-stratified performance results where feasible, and clarify that the observed gap reflects the combined challenges of the naturalistic cross-lingual document setting rather than isolating a single variable. We maintain that the direct comparison remains informative for highlighting real-world difficulties. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with external dataset and model evaluations

full rationale

The paper introduces SwissGov-RSD as a new human-annotated dataset of 224 multi-parallel documents with token-level semantic difference labels and reports empirical performance of LLMs and encoder models on it. No mathematical derivations, equations, or parameter-fitting steps are present that could reduce to self-definition or fitted inputs by construction. The central claim (poor model performance relative to monolingual/synthetic benchmarks) rests on direct comparison to the external annotations rather than any internal loop or self-citation chain. This is a standard benchmark paper whose results are falsifiable against the released dataset and do not rely on renaming prior results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality of human annotations and the representativeness of the Swiss government document collection; no free parameters are fitted, no new entities are postulated, and the axioms are standard domain assumptions in NLP benchmarking.

axioms (1)

domain assumption Human annotators can reliably identify token-level semantic differences in parallel documents
The benchmark validity depends on the accuracy and consistency of the human token-level annotations described in the abstract.

pith-pipeline@v0.9.0 · 5454 in / 1206 out tokens · 52542 ms · 2026-05-17T00:47:10.168037+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Recognizing semantic differences (RSD) concerns identifying which parts of two texts differ in meaning... token-level regression problem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, German Rigau, and Janyce Wiebe. 2016 a . https://doi.org/10.18653/v1/S16-1081 S em E val-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , ...

work page doi:10.18653/v1/s16-1081 2016
[4]

Eneko Agirre, Aitor Gonzalez-Agirre, I \ n igo Lopez-Gazpio, Montse Maritxalar, German Rigau, and Larraitz Uria. 2016 b . https://doi.org/10.18653/v1/S16-1082 S em E val-2016 task 2: Interpretable semantic textual similarity . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pages 512--524, San Diego, Californi...

work page doi:10.18653/v1/s16-1082 2016
[5]

Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan, and Huaiyu Zhu. 2015. https://doi.org/10.3115/v1/P15-1039 Generating high quality proposition B anks for multilingual semantic role labeling . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Confe...

work page doi:10.3115/v1/p15-1039 2015
[6]

Chantal Amrhein, Nikita Moghe, and Liane Guillou. 2022. https://aclanthology.org/2022.wmt-1.44/ ACES : Translation accuracy challenge sets for evaluating machine translation metrics . In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 479--513, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics

work page 2022
[7]

Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte Miguel Alves, Andre Martins, Ayoub Hammal, Caio Corro, CELINE HUDELOT, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, Jo \ a o Alves, Kevin El Haddad, Manuel Faysse, Maxime Peyrard, Nuno M Guerreiro, Patrick Fernandes, Ricardo Rei, and Pierre Colombo. 2025. https://openreview.net...

work page 2025
[8]

and Angeli, Gabor and Potts, Christopher and Manning, Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632--642, Lisbon, Portugal. Association for Computational Linguistics

work page doi:10.18653/v1/d15-1075 2015
[9]

Aljoscha Burchardt. 2013. https://aclanthology.org/2013.tc-1.6 Multidimensional quality metrics: a flexible system for assessing translation quality . In Proceedings of Translating and the Computer 35, London, UK. Aslib

work page 2013
[10]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2318--2335, Bangkok,...

work page doi:10.18653/v1/2024.findings-acl.137 2024
[11]

Yang Chen, Chao Jiang, Alan Ritter, and Wei Xu. 2023. https://doi.org/10.18653/v1/2023.findings-acl.357 Frustratingly easy label projection for cross-lingual transfer . In Findings of the Association for Computational Linguistics: ACL 2023, pages 5775--5796, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.357 2023
[12]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2020.acl-main.747 2020
[13]

Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. https://doi.org/10.18653/v1/D18-1269 XNLI : Evaluating cross-lingual sentence representations . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475--2485, Brussels, Belgium. Association...

work page doi:10.18653/v1/d18-1269 2018
[14]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA : Efficient finetuning of quantized LLMs . In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[15]

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. https://doi.org/10.18653/v1/2022.acl-long.62 Language-agnostic BERT sentence embedding . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878--891, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.62 2022
[16]

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021. https://doi.org/10.1162/tacl_a_00437 Experts, errors, and context: A large-scale study of human evaluation for machine translation . Transactions of the Association for Computational Linguistics, 9:1460--1474

work page doi:10.1162/tacl_a_00437 2021
[17]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.552 S im CSE : Simple contrastive learning of sentence embeddings . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894--6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[18]

Naman Goyal, Jingfei Du, Myle Ott, Giri Anantharaman, and Alexis Conneau. 2021. https://doi.org/10.18653/v1/2021.repl4nlp-1.4 Larger-scale transformers for multilingual masked language modeling . In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 29--33, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.repl4nlp-1.4 2021
[19]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. https://doi.org/10.1038/s41586-025-09422-z DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learni...

work page doi:10.1038/s41586-025-09422-z 2025
[21]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lora: Low-rank adaptation of large language models . In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net

work page 2022
[22]

Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman Grundkiewicz, Marzena Karpinska, Maja Popović, Mrinmaya Sachan, and Mariya Shmatova. 2024. Error span annotation: A balanced approach for human evaluation of machine translation. In Proceedings of the Ninth Conference on Machine Translation, pages 1440--1453, Stroudsburg, PA, USA. Association for Compu...

work page 2024
[23]

Duong Minh Le, Yang Chen, Alan Ritter, and Wei Xu. 2024. https://openreview.net/forum?id=DayPQKXaQk Constrained decoding for cross-lingual label projection . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

work page 2024
[24]

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022. https://doi.org/10.18653/v1/2022.acl-long.464 A token-level reference-free hallucination detection benchmark for free-form text generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pag...

work page doi:10.18653/v1/2022.acl-long.464 2022
[25]

Arle Lommel, Serge Gladkoff, Alan Melby, Sue Ellen Wright, Ingemar Strandvik, Katerina Gasova, Angelika Vaasa, Andy Benzo, Romina Marazzato Sparano, Monica Foresi, Johani Innis, Lifeng Han, and Goran Nenadic. 2024. https://aclanthology.org/2024.amta-presentations.6/ The multi-range theory of translation quality measurement: MQM scoring models and statisti...

work page 2024
[26]

Marone Marc, Weller Orion, Fleshman William, Yang Eugene, Lawrie Dawn, and Benjamin Van Durme. 2025. MmBERT : A modern multilingual encoder with annealed language learning. arXiv [cs.CL]

work page 2025
[27]

Timothee Mickus, Elaine Zosa, Raul Vazquez, Teemu Vahtola, J \"o rg Tiedemann, Vincent Segonne, Alessandro Raganato, and Marianna Apidianaki. 2024. https://doi.org/10.18653/v1/2024.semeval-1.273 S em E val-2024 task 6: SHROOM , a shared-task on hallucinations and related observable overgeneration mistakes . In Proceedings of the 18th International Worksho...

work page doi:10.18653/v1/2024.semeval-1.273 2024
[28]

Nikita Moghe, Arnisa Fazla, Chantal Amrhein, Tom Kocmi, Mark Steedman, Alexandra Birch, Rico Sennrich, and Liane Guillou. 2025. https://doi.org/10.1162/coli_a_00537 Machine translation meta evaluation through translation accuracy challenge sets . Computational Linguistics, 51(1):73--137

work page doi:10.1162/coli_a_00537 2025
[29]

OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, and 7 others. 2025. https://arxiv.org/abs/2502.06807 Competitive programming with large re...

work page arXiv 2025
[30]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Tanmay Parekh, I-Hung Hsu, Kuan-Hao Huang, Kai-Wei Chang, and Nanyun Peng. 2024. https://doi.org/10.18653/v1/2024.naacl-long.321 Contextual label projection for cross-lingual structured prediction . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

work page doi:10.18653/v1/2024.naacl-long.321 2024
[32]

Guerreiro, Marcos Treviso, Luisa Coheur, Alon Lavie, and Andr \'e Martins

Ricardo Rei, Nuno M. Guerreiro, Marcos Treviso, Luisa Coheur, Alon Lavie, and Andr \'e Martins. 2023. https://doi.org/10.18653/v1/2023.acl-short.94 The inside story: Towards better understanding of machine translation neural evaluation metrics . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Pap...

work page doi:10.18653/v1/2023.acl-short.94 2023
[33]

Gabriele Sarti, Vilém Zouhar, Malvina Nissim, and Arianna Bisazza. 2025. Unsupervised word-level quality estimation for machine translation through the lens of annotators (dis)agreement. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18320--18337

work page 2025
[34]

Yves Scherrer, Luka Nerima, Lorenza Russo, Maria Ivanova, and Eric Wehrli. 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/772_Paper.pdf S wiss A dmin: A multilingual tagged parallel corpus of press releases . In Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14) , pages 1832--1836, Reykjavik, Icelan...

work page 2014
[35]

Lucia Specia, Kim Harris, Fr \'e d \'e ric Blain, Aljoscha Burchardt, Viviven Macketanz, Inguna Skadin, Matteo Negri, and Marco Turchi. 2017. https://aclanthology.org/2017.mtsummit-papers.5 Translation quality and productivity: A study on rich morphology languages . In Proceedings of Machine Translation Summit XVI: Research Track, pages 55--71, Nagoya Japan

work page 2017
[36]

Jannis Vamvas and Rico Sennrich. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.835 Towards unsupervised recognition of token-level semantic differences in related documents . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13543--13552, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.835 2023
[37]

Raul Vazquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, J \"o rg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sanchez Vega, Alessandro Raganato, Jind r ich Libovick \'y , Jussi Karlgren, Shaoxiong Ji, Jind r ich Helcl, Liane Guillou, Ona De Gibert, Jaione Bengoetxea, Joseph Attieh, and Marianna Apidianaki. 2025. https://aclanthology.org/2025.semeva...

work page 2025
[38]

Martin Volk, Chantal Amrhein, Noëmi Aepli, Mathias Müller, and Phillip Ströbel. 2016. Building a parallel corpus on the world's oldest banking magazine. In KONVENS. s.n

work page 2016
[39]

Yaushian Wang, Ashley Wu, and Graham Neubig. 2022. English contrastive learning can learn universal cross-lingual sentence embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9122--9133, Stroudsburg, PA, USA. Association for Computational Linguistics

work page 2022
[40]

Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli. 2025. https://aclanthology.org/2025.acl-long.127/ Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory effici...

work page 2025
[41]

Michelle Wastl, Jannis Vamvas, Selena Calleri, and Rico Sennrich. 2025. https://arxiv.org/abs/2504.21677 20min-xd: A comparable corpus of swiss news articles . Preprint, arXiv:2504.21677

work page arXiv 2025
[42]

Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A broad-coverage challenge corpus for sentence understanding through inference . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages 1112...

work page doi:10.18653/v1/n18-1101 2018
[43]

Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. https://doi.org/10.18653/v1/D19-1382 PAWS - X : A cross-lingual adversarial dataset for paraphrase identification . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)...

work page doi:10.18653/v1/d19-1382 2019
[44]

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, and 1 others. 2024. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, page...

work page 2024
[45]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv [cs.CL]

work page 2025
[46]

Yuan Zhang, Jason Baldridge, and Luheng He. 2019. https://doi.org/10.18653/v1/N19-1131 PAWS : Paraphrase adversaries from word scrambling . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 1298--1308, Minneapolis, Min...

work page doi:10.18653/v1/n19-1131 2019