pith. sign in

arxiv: 2605.07533 · v1 · submitted 2026-05-08 · 💻 cs.CL

Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation

Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsmachine translationtoken activation ratelow-resource languagesnon-English-centric pairsCOMET scoresreasoning modelsvocabulary utilization
0
0 comments X

The pith

LLMs under-activate target-language tokens when translating non-English-centric pairs, and this under-activation tracks with poorer quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models produce lower-quality translations for language pairs that are not centered on English, particularly those involving low-resource languages. It introduces Token Activation Rate as a metric that counts how many language-specific tokens from the vocabulary actually appear in the model's generated output. Lower rates of activation align with weaker COMET scores across 15 models and 22 pairs, and the metric matches patterns expected from known differences in training data exposure. Reasoning models sometimes emit longer sequences when activation is low, yet this compensation does not reliably restore translation quality.

Core claim

Across evaluations of 15 models on 22 language pairs, non-English-centric pairs produce lower COMET scores than English-centric ones; Token Activation Rate, the share of target-language tokens activated during generation, is correspondingly lower in those pairs and correlates strongly with the quality gap. Models with known higher exposure to a language in training data show higher TAR, confirming the metric as a proxy for representation strength. Reasoning models respond to low TAR by generating more tokens overall, though the effect on final quality varies by model.

What carries the argument

Token Activation Rate (TAR), the proportion of language-specific tokens from the model's vocabulary that appear in the generated translation, used as a proxy for how well the target language is represented internally.

If this is right

  • Non-English-centric language pairs suffer consistent quality losses that scale with reduced TAR.
  • Reasoning models increase generation length as a partial response to low TAR, with uneven effects on final output quality.
  • TAR values match expected language coverage from training data, allowing prediction of which pairs will underperform.
  • Token-level activation offers a diagnostic that distinguishes representation issues from other sources of translation failure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If TAR is the operative mechanism, lightweight interventions that encourage activation of specific tokens could improve low-resource results without retraining the full model.
  • The same under-activation pattern may limit performance in other generation tasks involving underrepresented languages.
  • Different tokenizers or vocabulary designs could be compared by their TAR profiles to isolate whether the problem is model architecture or token inventory.

Load-bearing premise

That the measured association between low TAR and poor translation performance reflects a causal role for token utilization rather than a byproduct of other differences such as data quality or model size.

What would settle it

A controlled change that raises TAR for a low-performing language pair (for example by vocabulary intervention) yet leaves COMET scores unchanged, or that produces high TAR without corresponding quality gains, would falsify the explanatory claim.

Figures

Figures reproduced from arXiv: 2605.07533 by Shenbin Qian, Yves Scherrer.

Figure 1
Figure 1. Figure 1: COMET scores of translations for 22 language pairs using Prompt 2 under zero-shot setting. DeepSeek-V3.2-Exp-671B-reasoner, respectively) (DeepSeek-AI, 2025). Llama-3.2-3B-Instruct (Meta AI, 2024) and gemma-3-27b-it (Gemma Team et al., 2025) were selected as decoder-only dense IT models, while t5gemma-xl-xl-prefixlm-it (Zhang et al., 2025) serves as a representative of recent encoder￾decoder IT models. Sin… view at source ↗
Figure 2
Figure 2. Figure 2: TAR for 13 different languages and 14 models (excluding Google Translate). First, we observe that non-English-centric LPs have substantially lower average COMET scores than English-centric pairs, with greater perfor￾mance variability across these LPs. This reflects the current state of the art in MT, namely the English-centricity of language resources. The fig￾ure also shows clear performance degradation f… view at source ↗
Figure 3
Figure 3. Figure 3: TAR of the vocabulary of Qwen3-4B-Thinking-2507 per language pair in the source (X axis) and target (Y axis) language against the average number of reasoning tokens. satory mechanism. Furthermore, we also explore whether generating more reasoning tokens at test time would improve translation quality. Reasoning Tokens vs TAR [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The average number of reasoning tokens from Qwen3-4B-Thinking-2507 vs the increase of COMET scores (∆COMET) compared to its IT model Qwen3-4B-Instruct-2507. cients between the average number of reasoning tokens and ∆COMET and ∆BLEU. The table re￾veals that their correlations are model-dependent. For Qwen models, more reasoning tokens exhibit a strong positive correlation with COMET score improvements, indi… view at source ↗
read the original abstract

Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript evaluates 15 LLMs (including four reasoning models) on machine translation across 22 language pairs of varying resource levels. It reports that non-English-centric pairs yield lower COMET scores, introduces Token Activation Rate (TAR) as the fraction of language-specific tokens activated during generation, validates TAR as a proxy for language representation quality against known training-data distributions, and finds that lower TAR is strongly associated with poorer translation performance. Reasoning LLMs are observed to generate more tokens when translating into low-TAR languages, interpreted as a compensatory mechanism.

Significance. If the TAR metric can be shown to capture an independent mechanism after appropriate controls, the work would provide a useful token-level explanation for LLM MT failures in low-resource settings and could inform tokenizer design or training strategies. The multi-model evaluation and the attempt to link generation statistics to external training knowledge are constructive elements. The primarily correlational evidence currently limits the strength of the explanatory claims.

major comments (3)
  1. [§5] §5 (TAR-COMET association): the reported strong negative association between TAR and COMET scores across the 22 LPs does not include controls for resource level or data volume; low-resource and non-English-centric pairs are definitionally expected to exhibit both lower token coverage and lower quality, so the association may be a byproduct rather than evidence of a distinct token-dynamics mechanism. Partial correlation or stratification by resource level is needed to support the central claim.
  2. [TAR validation] TAR validation paragraph: while TAR is validated using models with known language distributions in training data, no quantitative statistics (correlation coefficients, p-values, or confidence intervals) or details on the exact identification of language-specific tokens are provided, weakening the claim that TAR serves as a reliable proxy independent of tokenizer artifacts.
  3. [Reasoning LLM analysis] Reasoning-LLM analysis: the observation that reasoning models generate more tokens for low-TAR languages is presented descriptively, but the manuscript does not quantify the effect on COMET scores, test its statistical significance, or compare it against non-reasoning baselines to isolate TAR's causal role.
minor comments (3)
  1. [Model description] A table listing all 15 models with their parameter counts, training details, and whether they are reasoning models would improve reproducibility and clarity.
  2. [Abstract] The abstract introduces TAR without a concise definition; moving a one-sentence definition to the abstract would help readers.
  3. [Figures] Figures showing TAR vs. COMET should include error bars or per-pair variability to allow assessment of robustness across the 22 LPs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us clarify and strengthen the presentation of our findings on token dynamics in LLM-based machine translation. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (TAR-COMET association): the reported strong negative association between TAR and COMET scores across the 22 LPs does not include controls for resource level or data volume; low-resource and non-English-centric pairs are definitionally expected to exhibit both lower token coverage and lower quality, so the association may be a byproduct rather than evidence of a distinct token-dynamics mechanism. Partial correlation or stratification by resource level is needed to support the central claim.

    Authors: We agree that resource level and data volume represent important potential confounders, as non-English-centric pairs frequently align with lower-resource settings. In the revised manuscript, we have added stratification of the 22 language pairs into high-, medium-, and low-resource categories based on established MT benchmarks, along with partial correlation analysis between TAR and COMET scores while controlling for estimated training data volume (using publicly available corpus size proxies). The negative association between TAR and COMET remains statistically significant both within resource strata and after partialling out data volume effects (updated results and figures now appear in §5). This supports TAR capturing token-utilization dynamics that are not fully reducible to data availability. revision: yes

  2. Referee: [TAR validation] TAR validation paragraph: while TAR is validated using models with known language distributions in training data, no quantitative statistics (correlation coefficients, p-values, or confidence intervals) or details on the exact identification of language-specific tokens are provided, weakening the claim that TAR serves as a reliable proxy independent of tokenizer artifacts.

    Authors: We acknowledge that the original validation relied primarily on qualitative alignment with known training distributions. The revised manuscript now includes quantitative statistics: Pearson correlation coefficients (with p-values and 95% confidence intervals) between TAR values and log-scaled training data volumes for the evaluated models. We have also added explicit details on language-specific token identification, which combines tokenizer vocabulary metadata, Unicode script ranges, and frequency-based language assignment from the models' pretraining corpora (excluding cross-lingual shared tokens). These additions are incorporated into the TAR validation section and demonstrate that TAR tracks language representation beyond tokenizer-specific artifacts. revision: yes

  3. Referee: [Reasoning LLM analysis] Reasoning-LLM analysis: the observation that reasoning models generate more tokens for low-TAR languages is presented descriptively, but the manuscript does not quantify the effect on COMET scores, test its statistical significance, or compare it against non-reasoning baselines to isolate TAR's causal role.

    Authors: We accept that the reasoning-LLM analysis was initially descriptive. The revision quantifies the mean increase in generated tokens for low-TAR versus high-TAR target languages within the four reasoning models, reports effect sizes, and includes t-tests confirming statistical significance of the difference. We further compare token counts and resulting COMET scores against the non-reasoning models in our 15-model evaluation. These expanded results are now presented with tables in the relevant section. However, fully isolating a causal role for TAR would require targeted interventions such as tokenizer retraining or controlled ablation studies, which lie outside the observational scope of this work; we have therefore framed the compensatory mechanism as correlational and model-dependent rather than causal. revision: partial

Circularity Check

0 steps flagged

No significant circularity: TAR validated externally and association presented as empirical observation

full rationale

The paper defines TAR directly from observed generation statistics on language-specific tokens, validates it as a proxy by comparing against independently known training-data language distributions (external benchmark), and reports the negative association with COMET scores as a separate empirical finding across 22 LPs. No equations reduce a claimed prediction to a fitted input by construction, no self-citation chains carry load-bearing premises, and no ansatz or uniqueness result is smuggled in. The chain is self-contained against external knowledge of training distributions and standard MT metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that COMET scores are a reliable proxy for translation quality and that token activation rate meaningfully reflects language representation learned during pretraining. No free parameters are introduced. TAR itself is a newly defined metric rather than a postulated physical entity.

axioms (2)
  • domain assumption COMET scores accurately reflect translation quality differences across language pairs
    Invoked when claiming lower COMET for non-English-centric pairs indicates failure
  • domain assumption Token activation during generation is a direct indicator of how well the model represents the target language
    Used to interpret TAR as a proxy for language representation
invented entities (1)
  • Token Activation Rate (TAR) no independent evidence
    purpose: Metric to quantify effective use of language-specific tokens in generation
    Newly introduced quantity; no independent evidence beyond the paper's own validation step

pith-pipeline@v0.9.0 · 5498 in / 1423 out tokens · 31966 ms · 2026-05-11T02:16:40.562499+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages

  1. [1]

    Ahn, Janice, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large language models for mathematical reasoning: Progresses and challenges. In Falk, Neele, Sara Papi, and Mike Zhang, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , pages 225--2...

  2. [2]

    o sch, Maximilian B \

    Apertus Project , Alejandro Hern \'a ndez-Cano, Alexander H \"a gele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank D urech, Ido Hakimi, Juan Garc \' a Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabol c ec, Yixuan Xu, Michael Aerni, Badr AlKhamis...

  3. [3]

    Barrault, Lo \"i c, Magdalena Biesialska, Ond r ej Bojar, Marta R. Costa-juss \`a , Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljube s i \'c , Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. 2...

  4. [4]

    BigScience Workshop , Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili \'c , Daniel Hesslow, Roman Castagn \'e , Alexandra Sasha Luccioni, Fran c ois Yvon, Matthias Gall \'e , Jonathan Tow, Alexander M Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno \^ t Sagot, Niklas Muennighoff, Albert Villanova...

  5. [5]

    Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5:135--146

  6. [6]

    Castaldo, Antonio and Johanna Monti. 2024. Prompting large language models for idiomatic translation. In Vanroy, Bram, Marie-Aude Lefer, Lieve Macken, and Paola Ruffo, editors, Proceedings of the 1st Workshop on Creative-text Translation and Technology , pages 32--39, Sheffield, United Kingdom, June. European Association for Machine Translation

  7. [7]

    Caswell, Isaac. 2024. 110 new languages are coming to Google Translate . Accessed on 10, Dec 2025

  8. [8]

    Yamshchikov

    Chizhov, Pavel, Catherine Arnett, Elizaveta Korotkova, and Ivan P. Yamshchikov. 2024. BPE gets picky: Efficient vocabulary refinement during tokenizer training. In Al-Onaizan, Yaser, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 16587--16604, Miami, Florida, USA, No...

  9. [9]

    Court, Sara and Micha Elsner. 2024. Shortcomings of LLM s for low-resource translation: Retrieval and understanding are both the problem. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Ninth Conference on Machine Translation , pages 1332--1354, Miami, Florida, USA, November. Association for Computational Linguistics

  10. [10]

    Dang, John, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Ven...

  11. [11]

    DeepSeek-AI. 2025. Deepseek-v3.2-exp: Boosting long-context efficiency with deepseek sparse attention. Accessed on 08, Dec 2025

  12. [12]

    Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga \"e l Liu, Francesco Visin, Kathleen Kenea...

  13. [13]

    Groeneveld, Dirk, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Cr...

  14. [14]

    Guo, Daya, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, E...

  15. [15]

    Guzm \'a n, Francisco, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc ' Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: N epali -- E nglish and S inhala -- E nglish. In Inui, Kentaro, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Confe...

  16. [16]

    He, Sui. 2024. Prompting C hat GPT for translation: A comparative analysis of translation brief and persona prompts. In Scarton, Carolina, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, V \'i ctor M S \'a nchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Ca...

  17. [17]

    Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations

  18. [18]

    Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint , March

  19. [19]

    Hirak, Vitalii, Jaap Jumelet, and Arianna Bisazza. 2026. Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models . In Demberg, Vera, Kentaro Inui, and Llu \'i s Marquez, editors, Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volu...

  20. [20]

    Huang, Xu, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. 2025. B ench MAX : A comprehensive multilingual evaluation suite for large language models. In Christodoulopoulos, Christos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025 , pages 16751--16774...

  21. [21]

    Jiang, Juyong, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2025. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol. , July. Just Accepted

  22. [22]

    K2 Team , Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Seungwook Han, Bowen Tan, Gurpreet Gosal, Xudong Han, Varad Pimpalkhute, Shibo Hao, Ming Shan Hee, Joel Hestness, Haolong Jia, Liqun Ma, Aaryamonvikram Singh, Daria Soboleva, Natalia...

  23. [23]

    Seza Do g ru \"o z, and En-Shiun Lee

    Khiu, Eric, Hasti Toossi, David Anugraha, Jinyu Liu, Jiaxu Li, Juan Flores, Leandro Roman, A. Seza Do g ru \"o z, and En-Shiun Lee. 2024. Predicting machine translation performance on low-resource languages: The role of domain similarity. In Graham, Yvette and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024 , ...

  24. [24]

    Kocmi, Tom, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovi \'c , Mariya Shmatova, Steinth \'o r Steingr \'i msso...

  25. [25]

    Kulkarni, Ajinkya. 2015. TED Multilingual Parallel Corpus . GitHub, 12. Accessed on 08, Dec 2025

  26. [26]

    Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles , SOSP '23, page 611–626, New York, NY, USA. Association for Computing Machinery

  27. [27]

    Lambert, Nathan, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. ...

  28. [28]

    Lannelongue, Loïc, Jason Grealey, and Michael Inouye. 2021. Green algorithms: Quantifying the carbon footprint of computation. Advanced Science , 8(12):2100707

  29. [29]

    Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin

    Littell, Patrick, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Lapata, Mirella, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computa...

  30. [30]

    Liu, Sinuo, Chenyang Lyu, Minghao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, and Zifu Shang. 2025. New trends for modern machine translation with large reasoning models. arXiv preprint , March

  31. [31]

    Lundin, Jessica M, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, and Cody Carroll. 2025. The token tax: Systematic bias in multilingual tokenization. arXiv preprint , September

  32. [32]

    Martins, Pedro Henrique, Patrick Fernandes, Jo \ a o Alves, Nuno M Guerreiro, Ricardo Rei, Duarte M Alves, Jos \'e Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, Jos \'e G C de Souza, Alexandra Birch, and Andr \'e F T Martins. 2024. EuroLLM : Multilingual language models for europe. arXiv preprint , September

  33. [33]

    Meta AI . 2024. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models . Accessed on 08, Dec 2025

  34. [34]

    NLLB Team , Marta R Costa-juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk ...

  35. [35]

    Olmo Team , Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Sha...

  36. [36]

    OpenAI , Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich...

  37. [37]

    OpenAI . 2024. Introducing SWE-bench Verified . Accessed on 11, Dec 2025

  38. [38]

    Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. B leu: a method for automatic evaluation of machine translation. In Isabelle, Pierre, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311--318, Philadelphia, Pennsylvania, USA, July. Association for ...

  39. [39]

    Park, Jeonghyeok and Hai Zhao. 2019. Korean-to-Chinese Machine Translation using Chinese Character as Pivot Clue . arXiv preprint , November

  40. [40]

    o zdenur Demir, Dakotah Martinez, Ben Pageler, Kevin Zhou, Saeed Soori, Ori Press , Henry Tang, Paolo Rissone, Sean R Green, Lina Br \

    Phan, Long, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes,...

  41. [41]

    Ploeger, Esther, Johannes Bjerva, J \"o rg Tiedemann, and Robert Oestling. 2025. A cross-lingual perspective on neural machine translation difficulty. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Tenth Conference on Machine Translation , pages 340--354, Suzhou, China, November. Association for Computational Li...

  42. [42]

    Popovi \'c , Maja. 2017. chr F ++: words helping character n-grams. In Bojar, Ond r ej, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer, editors, Proceedings of the Second Conference on Machine Translation , pages 612--618, Copenhagen, Denmark, Septe...

  43. [43]

    Post, Matt. 2018. A call for clarity in reporting BLEU scores. In Bojar, Ond r ej, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aur \'e lie N \'e v \'e ol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor, editors, P...

  44. [44]

    Provilkov, Ivan, Dmitrii Emelianenko, and Elena Voita. 2020. BPE -dropout: Simple and effective subword regularization. In Jurafsky, Dan, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 1882--1892, Online, July. Association for Computational Linguistics

  45. [45]

    Pu, Xiao, Mingqi Gao, and Xiaojun Wan. 2023. Summarization is (almost) dead. arXiv preprint , September

  46. [46]

    Qwen Team . 2024. Qwen2.5: A Party of Foundation Models! , 9. Accessed on 08, Dec 2025

  47. [47]

    Qwen Team . 2025. Qwen3: Think Deeper, Act Faster , 4. Accessed on 08, Dec 2025

  48. [48]

    Rei, Ricardo, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET : A neural framework for MT evaluation. In Webber, Bonnie, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 2685--2702, Online, November. Association for Computational Linguistics

  49. [49]

    Rei, Ricardo, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022a. COMET -22: Unbabel- IST 2022 submission for the metrics shared task. In Koehn, Philipp, Lo \"i c Barrault, Ond r ej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-juss \`a , Christian ...

  50. [50]

    Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, Jos \'e G

    Rei, Ricardo, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, Jos \'e G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and Andr \'e F. T. Martins. 2022b. C omet K iwi: IST -unbabel 2022 submission for the quality estimation shared task. In Koehn, Philipp, Lo \"i c Barrault, Ond r ej Bojar, Fet...

  51. [51]

    Rei, Ricardo, Nuno M Guerreiro, Jos \'e Pombal, Jo \ a o Alves, Pedro Teixeirinha, Amin Farajian, and Andr \'e F T Martins. 2025. Tower+: Bridging generality and translation specialization in multilingual LLMs . arXiv preprint , June

  52. [52]

    Romanou, Angelika, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Flo...

  53. [53]

    Rust, Phillip, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. How good is your tokenizer? on the monolingual performance of multilingual language models. In Zong, Chengqing, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inte...

  54. [54]

    Scherrer, Yves, Luka Nerima, Lorenza Russo, Maria Ivanova, and Eric Wehrli. 2014. S wiss A dmin: A multilingual tagged parallel corpus of press releases. In Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth Internation...

  55. [55]

    Smith, and Oren Etzioni

    Schwartz, Roy, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green ai. Commun. ACM , 63(12):54–63, November

  56. [56]

    Sindhujan, Archchana, Diptesh Kanojia, Constantin Orasan, and Shenbin Qian. 2025. When LLM s struggle: Reference-less translation evaluation for low-resource languages. In Hettiarachchi, Hansi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage, editors, Proceedings of the First Works...

  57. [57]

    Singh, Telem Joyson, Ranbir Singh Sanasam, and Priyankoo Sarmah. 2025. An information-theoretic approach to reducing fertility in LLM s for M anipuri machine translation. In Inui, Kentaro, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors, Proceedings of ...

  58. [58]

    Song, Yewei, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State , Tegawend \'e F Bissyand \'e , and Jacques Klein. 2025. Is small language model the silver bullet to low-resource languages machine translation? arXiv preprint , August

  59. [59]

    Stewart, Craig, Ricardo Rei, Catarina Farinha, and Alon Lavie. 2020. COMET - deploying a new state-of-the-art MT evaluation metric in production. In Campbell, Janice, Dmitriy Genzel, Ben Huyck, and Patricia O ' Neill-Brown, editors, Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track) , pages...

  60. [60]

    Vilar, David, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. 2023. Prompting P a LM for translation: Assessing strategies and performance. In Rogers, Anna, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages ...

  61. [61]

    Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: a stickier benchmark for general-purpose language understanding systems . In Proceedings of the 33rd International Conference on Neural Information Processing Systems , Red Hook, NY, USA. Curran Associates Inc

  62. [62]

    Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the...

  63. [63]

    Ye, Yongshi, Biao Fu, Chongxuan Huang, Yidong Chen, and Xiaodong Shi. 2025. How well do large reasoning models translate? a comprehensive evaluation for multi-domain machine translation. arXiv preprint , May

  64. [64]

    Yue, Xiang, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. MMMU -pro: A more robust multi-discipline multimodal understanding benchmark. In Che, Wanxiang, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd A...

  65. [65]

    Zhang, Biao, Barry Haddow, and Alexandra Birch. 2023. Prompting large language model for machine translation: a case study . In Proceedings of the 40th International Conference on Machine Learning , ICML'23. JMLR.org

  66. [66]

    Zhang, Wenxuan, Yue Deng, Bing Liu, Sinno Pan, and Lidong Bing. 2024. Sentiment analysis in the era of large language models: A reality check. In Duh, Kevin, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024 , pages 3881--3906, Mexico City, Mexico, June. Association for Computational Linguistics

  67. [67]

    Zhang, Biao, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, and Zhe Dong. 2025. Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation . arXiv preprint , April