CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation

Yuu Jinnai

arxiv: 2606.21413 · v1 · pith:W6AU7YOGnew · submitted 2026-06-19 · 💻 cs.CL · cs.LG

CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation

Yuu Jinnai This is my paper

Pith reviewed 2026-06-26 14:25 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords machine translationJapanese-English translationcompact language modelssynthetic parallel corporaspecialized translation modelsmultilingual modelssupervised fine-tuningGRPO

0 comments

The pith

Compact models specialized for Japanese-English translation outperform large multilingual systems on real-world benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether building dedicated translation models for one language pair remains worthwhile once large multilingual systems exist. It creates a series of small open-source models ranging from 0.8B to 7B parameters focused solely on bidirectional Japanese-English translation. Training proceeds through two-stage supervised fine-tuning on synthetic parallel data, followed by Multi-Objective GRPO. On standard WMT benchmarks the multilingual models remain competitive, yet the compact specialized models deliver higher performance across real-world test sets drawn from business, legal, medical, financial, and patent domains. This outcome indicates that targeted smaller models can provide practical advantages for developers who need support for only a single language pair.

Core claim

The authors train compact language models on synthetically generated Japanese-English parallel corpora using two-stage supervised fine-tuning and Multi-Objective GRPO, then demonstrate that these models achieve stronger results than large multilingual translation systems on real-world benchmarks in business, legal, medical, financial, and patent domains.

What carries the argument

Two-stage supervised fine-tuning followed by Multi-Objective GRPO applied to synthetically generated parallel corpora to produce compact bidirectional Japanese-English translation models.

If this is right

Specialized compact models can surpass much larger multilingual systems when the target is a single language pair and the evaluation uses real-world rather than academic test sets.
Synthetic data generated at scale can support training that transfers effectively to professional translation tasks.
Models under 7B parameters can deliver higher accuracy than larger general-purpose systems on domain-specific translation.
Developers needing only Japanese-English support can obtain better practical performance by training a dedicated small model instead of relying on a general multilingual one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations that deploy translation in narrow domains may reduce inference costs by switching from large multilingual systems to smaller specialized ones.
The gap between WMT and real-world results suggests that academic benchmarks may understate the value of domain adaptation.
The same training recipe could be applied to other language pairs where high-quality synthetic data can be produced but authentic parallel text remains limited.

Load-bearing premise

The synthetically generated parallel corpora used for training are of high enough quality and domain coverage to produce models that generalize to real-world business, legal, medical, financial, and patent translations.

What would settle it

A side-by-side evaluation in which the compact models fail to exceed the multilingual models on a fresh collection of authentic Japanese-English translations drawn from the same professional domains.

Figures

Figures reproduced from arXiv: 2606.21413 by Yuu Jinnai.

**Figure 3.** Figure 3: Training loss curves for the Stage 1 of super [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss curves for the Stage 2 of su [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The curves for the Multi-Objective GRPO reward values. The training of 1.4B model is shown as all the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Queue, the cat, loves to sit on the keyboard [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Nowadays, large multilingual translation models demonstrate impressive translation capabilities in the machine translation benchmarks. This raises a practical question to the developers: is it worth developing translation models specialized for a particular language pair if you only need to support that language pair? To give an anecdotal answer to this question, we develop a family of small language models (0.8B, 1.4B, 3.3B, and 7B parameters) specialized for Japanese-English bidirectional translation. We employ a two-stage supervised fine-tuning approach followed by Multi-Objective GRPO (Ichihara et al. 2025) to train models on synthetically generated parallel corpora. We evaluate our models on WMT and real-world translation benchmarks across business, legal, medical, financial, and patent domains. While multilingual models achieve strong performance on WMT benchmarks, our compact models outperform them on real-world benchmarks, suggesting the practical utility of developing specialized translation models even in the era of large multilingual models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small JP-EN models beat multilingual baselines on real-world domains but rest on unvalidated synthetic data.

read the letter

The main point is that these compact models (0.8B to 7B) trained on synthetic JP-EN data outperform larger multilingual systems on business, legal, medical, financial, and patent benchmarks while trailing on WMT. That empirical split is the takeaway.

The paper applies standard two-stage supervised fine-tuning plus the cited Multi-Objective GRPO method to this language pair and releases the resulting open models. It directly tests the practical question of whether specialization still pays off when big multilingual models exist, and the domain-specific evaluation set is new relative to the priors it cites. Releasing the models is a concrete plus for anyone who needs fast JP-EN translation without loading a 70B+ system.

The soft spot is exactly the one flagged in the stress-test note: the abstract gives zero information on how the synthetic parallel corpora were created or validated. No n-gram overlap, terminology checks, or human preference scores against real in-domain references appear. If the generation process already favors the test domains, the reported gains could be artifacts rather than evidence for specialization. The abstract also omits statistical significance, exact metric definitions, and full baseline configurations, so the central claim cannot be assessed from the provided text.

This is for practitioners who need compact, pair-specific translation tools or who study when domain adaptation beats scale. A reader working on deployment or efficiency questions would find the comparison useful if the data details hold up.

It deserves peer review once the full manuscript supplies the missing data-generation and validation steps; without them the result stays hard to trust.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a family of compact open-source models (0.8B, 1.4B, 3.3B, 7B parameters) for bidirectional Japanese-English translation. The models are trained via two-stage supervised fine-tuning followed by Multi-Objective GRPO on synthetically generated parallel corpora. Evaluation is reported on WMT benchmarks plus real-world sets spanning business, legal, medical, financial, and patent domains; the central claim is that the specialized compact models outperform large multilingual models on the real-world benchmarks (while multilingual models remain strong on WMT), demonstrating the practical value of language-pair specialization.

Significance. If substantiated with adequate validation, the result would indicate that compact, pair-specific models can deliver superior domain generalization for Japanese-English translation in practical settings compared with general multilingual systems. The open-source release of the model family would constitute a concrete, usable contribution to the field.

major comments (2)

[Abstract] Abstract: the headline claim that compact models 'outperform' multilingual baselines on real-world business/legal/medical/financial/patent benchmarks is presented without any reported details on data-generation procedure, evaluation metrics, statistical significance testing, or exact baseline configurations. This information is load-bearing for the central empirical claim.
[Abstract / Methods (inferred from training description)] The manuscript provides no quantitative validation (e.g., n-gram overlap, terminology fidelity, or human preference scores) that the synthetically generated parallel corpora match the distribution or quality of authentic in-domain references. Because the generalization argument rests entirely on these corpora, the absence of such checks directly affects the defensibility of the outperformance result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the presentation of our empirical claims. We address each major comment below and have made revisions to improve clarity and transparency where the manuscript can be strengthened without altering the core results.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that compact models 'outperform' multilingual baselines on real-world business/legal/medical/financial/patent benchmarks is presented without any reported details on data-generation procedure, evaluation metrics, statistical significance testing, or exact baseline configurations. This information is load-bearing for the central empirical claim.

Authors: We agree that the abstract would benefit from additional context to make the central claim self-contained. In the revised manuscript, we have expanded the abstract to specify the primary metrics (BLEU and COMET), the exact multilingual baselines compared (NLLB-3.3B and Llama-3-8B variants), and that statistical significance was evaluated using paired bootstrap resampling. The data-generation procedure remains detailed in Section 3 but is now referenced concisely in the abstract. These changes address the load-bearing information without exceeding typical abstract length. revision: yes
Referee: [Abstract / Methods (inferred from training description)] The manuscript provides no quantitative validation (e.g., n-gram overlap, terminology fidelity, or human preference scores) that the synthetically generated parallel corpora match the distribution or quality of authentic in-domain references. Because the generalization argument rests entirely on these corpora, the absence of such checks directly affects the defensibility of the outperformance result.

Authors: We acknowledge that direct quantitative validation of the synthetic corpora (such as n-gram overlap or human preference scores against authentic references) is not reported. The manuscript relies on the downstream performance on real-world domain test sets as evidence of utility. In revision, we have added a dedicated paragraph in Section 3.2 describing the synthetic generation pipeline and its domain-specific prompting strategy, plus a limitations paragraph noting the absence of explicit fidelity metrics and identifying this as an area for future verification. We maintain that the end-to-end benchmark results remain informative, but agree the point merits explicit acknowledgment. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training/evaluation chain is independent of inputs

full rationale

The paper describes an empirical pipeline (two-stage SFT + Multi-Objective GRPO on synthetic corpora, followed by evaluation on WMT and domain-specific benchmarks). No equations, derivations, or first-principles claims exist that could reduce to fitted parameters or self-definitions by construction. The central result (compact models outperforming multilingual baselines on real-world benchmarks) is measured on held-out external test sets and does not rely on any load-bearing self-citation chain or ansatz smuggled via prior work. The Ichihara et al. 2025 citation is to a training method, not to a uniqueness theorem or result that the present paper's claims depend upon. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities stated.

axioms (1)

domain assumption Synthetic parallel corpora can train effective domain-specific translation models
Central to the two-stage training approach described.

pith-pipeline@v0.9.1-grok · 5693 in / 999 out tokens · 13949 ms · 2026-06-26T14:25:10.669880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 1 internal anchor

[1]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, and 1 others. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925

Pith/arXiv arXiv 2025
[2]

Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. https://doi.org/10.18653/v1/N19-1388 Massively multilingual neural machine translation . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3874--3884, Minneapolis, M...

work page doi:10.18653/v1/n19-1388 2019
[3]

Mikko Aulamo, Nikolay Bogoychev, Shaoxiong Ji, Graeme Nail, Gema Ram \'i rez-S \'a nchez, J \"o rg Tiedemann, Jelmer van der Linde, and Jaume Zaragoza. 2023. https://aclanthology.org/2023.eamt-1.61/ HPLT : High performance language technologies . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 517--5...

2023
[4]

Mona Baker, Gill Francis, and Elena Tognini-Bonelli, editors. 1993. https://www.jbe-platform.com/content/books/9789027285874 Text and Technology . John Benjamins

arXiv 1993
[5]

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Ba \ n \'o n, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Haji c , Jind r ich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kyt \"o niemi, Veronika Laippala, Petter M hlum, Bhavitvya Malik, and 16 others. 2025. https://doi.org/1...

work page doi:10.18653/v1/2025.acl-long.854 2025
[6]

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. 2025 a . https://openreview.net/forum?id=CdFnEu0JZV OR -bench: An over-refusal benchmark for large language models . In Forty-second International Conference on Machine Learning

2025
[7]

Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang. 2025 b . https://doi.org/10.18653/v1/2025.naacl-long.280 Multilingual machine translation with open large language models at practical scale: An empirical study . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...

work page doi:10.18653/v1/2025.naacl-long.280 2025
[8]

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin. 2021. http://jmlr.org/papers/v22/20-1307.html Beyond english-centric multilingual machine translation . Journa...

2021
[9]

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Geza Kovacs, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, and 2 others. 2026. Translate G emma T echnical R eport. arXiv prepri...

arXiv 2026
[10]

Markus Freitag, David Grangier, and Isaac Caswell. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.5 BLEU might be guilty but references are not innocent . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61--71, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.5 2020
[11]

Markus Freitag, Nitika Mathur, Daniel Deutsch, Chi-Kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Frederic Blain, Tom Kocmi, Jiayi Wang, David Ifeoluwa Adelani, Marianna Buchicchio, Chrysoula Zerva, and Alon Lavie. 2024. https://doi.org/10.18653/v1/2024.wmt-1.2 Are LLM s breaking MT metrics? results of the WMT 24 metrics shared task . In Proc...

work page doi:10.18653/v1/2024.wmt-1.2 2024
[12]

Leo Gao, John Schulman, and Jacob Hilton. 2023. https://proceedings.mlr.press/v202/gao23h.html Scaling L aws for R eward M odel O veroptimization . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10835--10866. PMLR

2023
[13]

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.36 Arcee ' s M erge K it: A toolkit for merging large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: ...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[14]

C. A. E. Goodhart. 1984. https://doi.org/10.1007/978-1-349-17295-5_4 Problems of Monetary Management: The UK Experience , pages 91--121. Macmillan Education UK, London

work page doi:10.1007/978-1-349-17295-5_4 1984
[15]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. Deep S eek- R 1: I ncentivizing R easoning C apability in LLM s via R einforcement L earning. arXiv preprint arX...

Pith/arXiv arXiv 2025
[16]

Masanori Hirano and Kentaro Imashiro. 2025. 金融分野に特化した複数ターン日本語生成ベンチマークの構築 [translate: Development of a multi-turn japanese generation benchmark specialized in the financial sector]. In Proceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing

2025
[17]

Masanori Hirano, Kentaro Imashiro, Kento Nozawa, and Kaizaburo Nakahachi. 2025. https://doi.org/10.51094/jxiv.1461 Plamo translate: 翻訳特化大規模言語モデルの開発 [translate: Plamo translate: Development of a large-scale language model specialized for translation] . In Proceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing

work page doi:10.51094/jxiv.1461 2025
[18]

Hieu Hoang and Philipp Koehn. 2008. https://aclanthology.org/W08-0510/ Design of the M oses decoder for statistical machine translation . In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pages 58--65, Columbus, Ohio. Association for Computational Linguistics

2008
[19]

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, and Zhipeng Wang. 2025. https://openreview.net/forum?id=36SjAIT42G Liger-kernel: Efficient triton kernels for LLM training . In Championing Open-source DEvelopment in ML Workshop @ ICML25

2025
[20]

Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Mitsuki Sakamoto, Ryota Mitsuhashi, and Eiji Uchibe. 2025. https://doi.org/10.48550/arXiv.2509.22047 M O-GRPO : M itigating R eward H acking of G roup R elative P olicy O ptimization on M ulti- O bjective P roblems . arXiv preprint arXiv:2509.22047

work page doi:10.48550/arxiv.2509.22047 2025
[21]

Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein

Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. https://openreview.net/forum?id=0bMmZ3fkCk NEFT une: Noisy embeddings improve instruction finetuning . In The Twelfth International Con...

2024
[22]

Junfeng Jiang, Jiahao Huang, and Akiko Aizawa. 2025. https://aclanthology.org/2025.coling-main.395/ JM ed B ench: A benchmark for evaluating J apanese biomedical large language models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 5918--5935, Abu Dhabi, UAE. Association for Computational Linguistics

2025
[23]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

work page doi:10.18653/v1/2020.acl-main.560 2020
[24]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. https://aclanthology.org/E17-2068/ Bag of tricks for efficient text classification . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 427--431, Valencia, Spain. Association for Computationa...

2017
[25]

Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.wmt-1.35 2024
[26]

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, and 10 others. 2025. https://doi.org/10.18653...

work page doi:10.18653/v1/2025.wmt-1.22 2025
[27]

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovi \'c , and 3 others. 2024. https://doi.org/10.18653/v1/...

work page doi:10.18653/v1/2024.wmt-1.1 2024
[28]

Tom Kocmi and Christian Federmann. 2023 a . https://doi.org/10.18653/v1/2023.wmt-1.64 GEMBA - MQM : Detecting translation quality error spans with GPT -4 . In Proceedings of the Eighth Conference on Machine Translation, pages 768--775, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.wmt-1.64 2023
[29]

Tom Kocmi and Christian Federmann. 2023 b . https://aclanthology.org/2023.eamt-1.19/ Large language models are state-of-the-art evaluators of translation quality . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193--203, Tampere, Finland. European Association for Machine Translation

2023
[30]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. https://doi.org/10.18653/v1/2022.acl-long.577 Deduplicating training data makes language models better . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424...

work page doi:10.18653/v1/2022.acl-long.577 2022
[31]

Youyuan Lin, Masaaki Nagata, and Chenhui Chu. 2024. Post-editing with error annotation for machine translation: Dataset construction using gpt-4. In Proceedings of the Thirtieth Annual Meeting of the Association for Natural Language Processing

2024
[32]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. 2025. https://doi.org/10.48550/arXiv.2503.20783 Understanding R 1- Z ero-like training: A critical perspective . In Conference on Language Modeling (COLM)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783 2025
[33]

Marco Lui and Timothy Baldwin. 2012. https://aclanthology.org/P12-3005/ langid.py: An off-the-shelf language identification tool . In Proceedings of the ACL 2012 System Demonstrations , pages 25--30, Jeju Island, Korea. Association for Computational Linguistics

2012
[34]

Makoto Morishita, Katsuki Chousa, Jun Suzuki, and Masaaki Nagata. 2022. https://aclanthology.org/2022.lrec-1.721/ JP ara C rawl v3.0: A large-scale E nglish- J apanese parallel corpus . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6704--6710, Marseille, France. European Language Resources Association

2022
[35]

Makoto Morishita, Jun Suzuki, and Masaaki Nagata. 2020. https://aclanthology.org/2020.lrec-1.443/ JP ara C rawl: A large scale web-based E nglish- J apanese parallel corpus . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3603--3609, Marseille, France. European Language Resources Association

2020
[36]

Toshiaki Nakazawa, Takashi Tsunakawa, Isao Goto, Kazuhiro Kasada, Katsuhito Sudoh, Shoichi Okuyama, Takashi Ieda, and Masaaki Nagata. 2025. https://doi.org/10.18653/v1/2025.wat-1.1 Findings of the first patent claims translation task at WAT 2025 . In Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025), pages 1--15, Mumbai, India. Associati...

work page doi:10.18653/v1/2025.wat-1.1 2025
[37]

Dayy \'a n O ' Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, and J \"o rg Tiedemann. 2025. https://doi.org/10.18653/v1/2025.wmt-1.17 D oc HPLT : A massively multilingual document-level translation dataset . In Proceedings of the Tenth Conference on Machine Translation, pages 286--300, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.wmt-1.17 2025
[38]

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, and Aleksandar Samardžić. 2025. Torchao: Pytorch-native training-to-serving model optimization. arXiv preprint arXiv:2507.16099

arXiv 2025
[39]

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. 2022. https://openreview.net/forum?id=JYtwGwIL7ye The E ffects of R eward M isspecification: M apping and M itigating M isaligned M odels . In International Conference on Learning Representations

2022
[40]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

work page doi:10.3115/1073083.1073135 2002
[41]

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. https://openreview.net/forum?id=n6SCkn2QaG The fineweb datasets: Decanting the web for the finest text data at scale . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchma...

2024
[42]

Guerreiro, Ricardo Rei, and Andr \'e F

Jos \'e Pombal, Nuno M. Guerreiro, Ricardo Rei, and Andr \'e F. T. Martins. 2025 a . https://doi.org/10.1162/tacl.a.37 Adding chocolate to mint: Mitigating metric interference in machine translation . Transactions of the Association for Computational Linguistics, 13:1319--1339

work page doi:10.1162/tacl.a.37 2025
[43]

Jos \'e Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, and Andre Martins. 2025 b . M-prometheus: A suite of open multilingual LLM judges. In Second Conference on Language Modeling

2025
[44]

Matt Post. 2018. https://doi.org/10.18653/v1/W18-6319 A call for clarity in reporting BLEU scores . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Brussels, Belgium. Association for Computational Linguistics

work page doi:10.18653/v1/w18-6319 2018
[45]

Ricardo Rei, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.52 COMET -22: Unbabel- IST 2022 submission for the metrics shared task . In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578--5...

work page doi:10.18653/v1/2022.wmt-1.52 2022
[46]

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.wmt-1.101 Unbabel ' s participation in the WMT 20 metrics shared task . In Proceedings of the Fifth Conference on Machine Translation, pages 911--920, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.wmt-1.101 2020
[47]

Mat \= i ss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa. 2019. https://doi.org/10.18653/v1/D19-5204 Designing the business conversation corpus . In Proceedings of the 6th Workshop on Asian Translation, pages 54--61, Hong Kong, China. Association for Computational Linguistics

work page doi:10.18653/v1/d19-5204 2019
[48]

Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.naacl-long.301 XST est: A test suite for identifying exaggerated safety behaviours in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Ling...

work page doi:10.18653/v1/2024.naacl-long.301 2024
[49]

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. 2021. https://doi.org/10.18653/v1/2021.acl-long.507 CCM atrix: Mining billions of high-quality parallel sentences on the web . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

work page doi:10.18653/v1/2021.acl-long.507 2021
[50]

NLLB Team. 2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841--846

2024
[51]

Seiko Yamagishi, Shunsuke Shindo, and Yusuke Miyao. 2025. 大規模言語モデルの法廷通訳への導入可能性の検証 [translate: An investigation into the feasibility of introducing large language models into court interpretation]. In Proceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing

2025
[52]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025
[53]

Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, and Di Wang. 2025 a . Hy-mt1.5 technical report. arXiv preprint arXiv:2512.24092

arXiv 2025
[54]

Mao Zheng, Zheng Li, Yang Du, Bingxin Qu, and Mingyang Song. 2025 b . https://doi.org/10.18653/v1/2025.wmt-1.36 Shy-hunyuan- MT at WMT 25 general machine translation shared task . In Proceedings of the Tenth Conference on Machine Translation, pages 607--613, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.wmt-1.36 2025
[55]

Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, and Di Wang. 2025 c . Hunyuan-mt technical report. arXiv preprint arXiv:2509.05209

arXiv 2025

[1] [1]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, and 1 others. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925

Pith/arXiv arXiv 2025

[2] [2]

Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. https://doi.org/10.18653/v1/N19-1388 Massively multilingual neural machine translation . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 3874--3884, Minneapolis, M...

work page doi:10.18653/v1/n19-1388 2019

[3] [3]

Mikko Aulamo, Nikolay Bogoychev, Shaoxiong Ji, Graeme Nail, Gema Ram \'i rez-S \'a nchez, J \"o rg Tiedemann, Jelmer van der Linde, and Jaume Zaragoza. 2023. https://aclanthology.org/2023.eamt-1.61/ HPLT : High performance language technologies . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 517--5...

2023

[4] [4]

Mona Baker, Gill Francis, and Elena Tognini-Bonelli, editors. 1993. https://www.jbe-platform.com/content/books/9789027285874 Text and Technology . John Benjamins

arXiv 1993

[5] [5]

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Ba \ n \'o n, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Haji c , Jind r ich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kyt \"o niemi, Veronika Laippala, Petter M hlum, Bhavitvya Malik, and 16 others. 2025. https://doi.org/1...

work page doi:10.18653/v1/2025.acl-long.854 2025

[6] [6]

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. 2025 a . https://openreview.net/forum?id=CdFnEu0JZV OR -bench: An over-refusal benchmark for large language models . In Forty-second International Conference on Machine Learning

2025

[7] [7]

Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang. 2025 b . https://doi.org/10.18653/v1/2025.naacl-long.280 Multilingual machine translation with open large language models at practical scale: An empirical study . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human ...

work page doi:10.18653/v1/2025.naacl-long.280 2025

[8] [8]

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin. 2021. http://jmlr.org/papers/v22/20-1307.html Beyond english-centric multilingual machine translation . Journa...

2021

[9] [9]

Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Geza Kovacs, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, and 2 others. 2026. Translate G emma T echnical R eport. arXiv prepri...

arXiv 2026

[10] [10]

Markus Freitag, David Grangier, and Isaac Caswell. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.5 BLEU might be guilty but references are not innocent . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61--71, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.5 2020

[11] [11]

Markus Freitag, Nitika Mathur, Daniel Deutsch, Chi-Kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Frederic Blain, Tom Kocmi, Jiayi Wang, David Ifeoluwa Adelani, Marianna Buchicchio, Chrysoula Zerva, and Alon Lavie. 2024. https://doi.org/10.18653/v1/2024.wmt-1.2 Are LLM s breaking MT metrics? results of the WMT 24 metrics shared task . In Proc...

work page doi:10.18653/v1/2024.wmt-1.2 2024

[12] [12]

Leo Gao, John Schulman, and Jacob Hilton. 2023. https://proceedings.mlr.press/v202/gao23h.html Scaling L aws for R eward M odel O veroptimization . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10835--10866. PMLR

2023

[13] [13]

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.36 Arcee ' s M erge K it: A toolkit for merging large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: ...

work page doi:10.18653/v1/2024.emnlp-industry.36 2024

[14] [14]

C. A. E. Goodhart. 1984. https://doi.org/10.1007/978-1-349-17295-5_4 Problems of Monetary Management: The UK Experience , pages 91--121. Macmillan Education UK, London

work page doi:10.1007/978-1-349-17295-5_4 1984

[15] [15]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. Deep S eek- R 1: I ncentivizing R easoning C apability in LLM s via R einforcement L earning. arXiv preprint arX...

Pith/arXiv arXiv 2025

[16] [16]

Masanori Hirano and Kentaro Imashiro. 2025. 金融分野に特化した複数ターン日本語生成ベンチマークの構築 [translate: Development of a multi-turn japanese generation benchmark specialized in the financial sector]. In Proceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing

2025

[17] [17]

Masanori Hirano, Kentaro Imashiro, Kento Nozawa, and Kaizaburo Nakahachi. 2025. https://doi.org/10.51094/jxiv.1461 Plamo translate: 翻訳特化大規模言語モデルの開発 [translate: Plamo translate: Development of a large-scale language model specialized for translation] . In Proceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing

work page doi:10.51094/jxiv.1461 2025

[18] [18]

Hieu Hoang and Philipp Koehn. 2008. https://aclanthology.org/W08-0510/ Design of the M oses decoder for statistical machine translation . In Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pages 58--65, Columbus, Ohio. Association for Computational Linguistics

2008

[19] [19]

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, and Zhipeng Wang. 2025. https://openreview.net/forum?id=36SjAIT42G Liger-kernel: Efficient triton kernels for LLM training . In Championing Open-source DEvelopment in ML Workshop @ ICML25

2025

[20] [20]

Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Mitsuki Sakamoto, Ryota Mitsuhashi, and Eiji Uchibe. 2025. https://doi.org/10.48550/arXiv.2509.22047 M O-GRPO : M itigating R eward H acking of G roup R elative P olicy O ptimization on M ulti- O bjective P roblems . arXiv preprint arXiv:2509.22047

work page doi:10.48550/arxiv.2509.22047 2025

[21] [21]

Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein

Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. https://openreview.net/forum?id=0bMmZ3fkCk NEFT une: Noisy embeddings improve instruction finetuning . In The Twelfth International Con...

2024

[22] [22]

Junfeng Jiang, Jiahao Huang, and Akiko Aizawa. 2025. https://aclanthology.org/2025.coling-main.395/ JM ed B ench: A benchmark for evaluating J apanese biomedical large language models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 5918--5935, Abu Dhabi, UAE. Association for Computational Linguistics

2025

[23] [23]

Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. https://doi.org/10.18653/v1/2020.acl-main.560 The state and fate of linguistic diversity and inclusion in the NLP world . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282--6293, Online. Association for Computational...

work page doi:10.18653/v1/2020.acl-main.560 2020

[24] [24]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. https://aclanthology.org/E17-2068/ Bag of tricks for efficient text classification . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 427--431, Valencia, Spain. Association for Computationa...

2017

[25] [25]

Juraj Juraska, Daniel Deutsch, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.35 M etric X -24: The G oogle submission to the WMT 2024 metrics shared task . In Proceedings of the Ninth Conference on Machine Translation, pages 492--504, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.wmt-1.35 2024

[26] [26]

Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, and 10 others. 2025. https://doi.org/10.18653...

work page doi:10.18653/v1/2025.wmt-1.22 2025

[27] [27]

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovi \'c , and 3 others. 2024. https://doi.org/10.18653/v1/...

work page doi:10.18653/v1/2024.wmt-1.1 2024

[28] [28]

Tom Kocmi and Christian Federmann. 2023 a . https://doi.org/10.18653/v1/2023.wmt-1.64 GEMBA - MQM : Detecting translation quality error spans with GPT -4 . In Proceedings of the Eighth Conference on Machine Translation, pages 768--775, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.wmt-1.64 2023

[29] [29]

Tom Kocmi and Christian Federmann. 2023 b . https://aclanthology.org/2023.eamt-1.19/ Large language models are state-of-the-art evaluators of translation quality . In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193--203, Tampere, Finland. European Association for Machine Translation

2023

[30] [30]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. https://doi.org/10.18653/v1/2022.acl-long.577 Deduplicating training data makes language models better . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424...

work page doi:10.18653/v1/2022.acl-long.577 2022

[31] [31]

Youyuan Lin, Masaaki Nagata, and Chenhui Chu. 2024. Post-editing with error annotation for machine translation: Dataset construction using gpt-4. In Proceedings of the Thirtieth Annual Meeting of the Association for Natural Language Processing

2024

[32] [32]

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. 2025. https://doi.org/10.48550/arXiv.2503.20783 Understanding R 1- Z ero-like training: A critical perspective . In Conference on Language Modeling (COLM)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783 2025

[33] [33]

Marco Lui and Timothy Baldwin. 2012. https://aclanthology.org/P12-3005/ langid.py: An off-the-shelf language identification tool . In Proceedings of the ACL 2012 System Demonstrations , pages 25--30, Jeju Island, Korea. Association for Computational Linguistics

2012

[34] [34]

Makoto Morishita, Katsuki Chousa, Jun Suzuki, and Masaaki Nagata. 2022. https://aclanthology.org/2022.lrec-1.721/ JP ara C rawl v3.0: A large-scale E nglish- J apanese parallel corpus . In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6704--6710, Marseille, France. European Language Resources Association

2022

[35] [35]

Makoto Morishita, Jun Suzuki, and Masaaki Nagata. 2020. https://aclanthology.org/2020.lrec-1.443/ JP ara C rawl: A large scale web-based E nglish- J apanese parallel corpus . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3603--3609, Marseille, France. European Language Resources Association

2020

[36] [36]

Toshiaki Nakazawa, Takashi Tsunakawa, Isao Goto, Kazuhiro Kasada, Katsuhito Sudoh, Shoichi Okuyama, Takashi Ieda, and Masaaki Nagata. 2025. https://doi.org/10.18653/v1/2025.wat-1.1 Findings of the first patent claims translation task at WAT 2025 . In Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025), pages 1--15, Mumbai, India. Associati...

work page doi:10.18653/v1/2025.wat-1.1 2025

[37] [37]

Dayy \'a n O ' Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, and J \"o rg Tiedemann. 2025. https://doi.org/10.18653/v1/2025.wmt-1.17 D oc HPLT : A massively multilingual document-level translation dataset . In Proceedings of the Tenth Conference on Machine Translation, pages 286--300, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.wmt-1.17 2025

[38] [38]

Andrew Or, Apurva Jain, Daniel Vega-Myhre, Jesse Cai, Charles David Hernandez, Zhenrui Zheng, Driss Guessous, Vasiliy Kuznetsov, Christian Puhrsch, Mark Saroufim, Supriya Rao, Thien Tran, and Aleksandar Samardžić. 2025. Torchao: Pytorch-native training-to-serving model optimization. arXiv preprint arXiv:2507.16099

arXiv 2025

[39] [39]

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. 2022. https://openreview.net/forum?id=JYtwGwIL7ye The E ffects of R eward M isspecification: M apping and M itigating M isaligned M odels . In International Conference on Learning Representations

2022

[40] [40]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

work page doi:10.3115/1073083.1073135 2002

[41] [41]

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. https://openreview.net/forum?id=n6SCkn2QaG The fineweb datasets: Decanting the web for the finest text data at scale . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchma...

2024

[42] [42]

Guerreiro, Ricardo Rei, and Andr \'e F

Jos \'e Pombal, Nuno M. Guerreiro, Ricardo Rei, and Andr \'e F. T. Martins. 2025 a . https://doi.org/10.1162/tacl.a.37 Adding chocolate to mint: Mitigating metric interference in machine translation . Transactions of the Association for Computational Linguistics, 13:1319--1339

work page doi:10.1162/tacl.a.37 2025

[43] [43]

Jos \'e Pombal, Dongkeun Yoon, Patrick Fernandes, Ian Wu, Seungone Kim, Ricardo Rei, Graham Neubig, and Andre Martins. 2025 b . M-prometheus: A suite of open multilingual LLM judges. In Second Conference on Language Modeling

2025

[44] [44]

Matt Post. 2018. https://doi.org/10.18653/v1/W18-6319 A call for clarity in reporting BLEU scores . In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186--191, Brussels, Belgium. Association for Computational Linguistics

work page doi:10.18653/v1/w18-6319 2018

[45] [45]

Ricardo Rei, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022. https://doi.org/10.18653/v1/2022.wmt-1.52 COMET -22: Unbabel- IST 2022 submission for the metrics shared task . In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578--5...

work page doi:10.18653/v1/2022.wmt-1.52 2022

[46] [46]

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. https://doi.org/10.18653/v1/2020.wmt-1.101 Unbabel ' s participation in the WMT 20 metrics shared task . In Proceedings of the Fifth Conference on Machine Translation, pages 911--920, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.wmt-1.101 2020

[47] [47]

Mat \= i ss Rikters, Ryokan Ri, Tong Li, and Toshiaki Nakazawa. 2019. https://doi.org/10.18653/v1/D19-5204 Designing the business conversation corpus . In Proceedings of the 6th Workshop on Asian Translation, pages 54--61, Hong Kong, China. Association for Computational Linguistics

work page doi:10.18653/v1/d19-5204 2019

[48] [48]

Paul R \"o ttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.naacl-long.301 XST est: A test suite for identifying exaggerated safety behaviours in large language models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Ling...

work page doi:10.18653/v1/2024.naacl-long.301 2024

[49] [49]

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. 2021. https://doi.org/10.18653/v1/2021.acl-long.507 CCM atrix: Mining billions of high-quality parallel sentences on the web . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

work page doi:10.18653/v1/2021.acl-long.507 2021

[50] [50]

NLLB Team. 2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841--846

2024

[51] [51]

Seiko Yamagishi, Shunsuke Shindo, and Yusuke Miyao. 2025. 大規模言語モデルの法廷通訳への導入可能性の検証 [translate: An investigation into the feasibility of introducing large language models into court interpretation]. In Proceedings of the Thirty-first Annual Meeting of the Association for Natural Language Processing

2025

[52] [52]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025

[53] [53]

Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, and Di Wang. 2025 a . Hy-mt1.5 technical report. arXiv preprint arXiv:2512.24092

arXiv 2025

[54] [54]

Mao Zheng, Zheng Li, Yang Du, Bingxin Qu, and Mingyang Song. 2025 b . https://doi.org/10.18653/v1/2025.wmt-1.36 Shy-hunyuan- MT at WMT 25 general machine translation shared task . In Proceedings of the Tenth Conference on Machine Translation, pages 607--613, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.wmt-1.36 2025

[55] [55]

Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, and Di Wang. 2025 c . Hunyuan-mt technical report. arXiv preprint arXiv:2509.05209

arXiv 2025