Why do Large Language Models Fail in Low-resource Translation? Unraveling the Token Dynamics of Large Language Models for Machine Translation
Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3
The pith
LLMs under-activate target-language tokens when translating non-English-centric pairs, and this under-activation tracks with poorer quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across evaluations of 15 models on 22 language pairs, non-English-centric pairs produce lower COMET scores than English-centric ones; Token Activation Rate, the share of target-language tokens activated during generation, is correspondingly lower in those pairs and correlates strongly with the quality gap. Models with known higher exposure to a language in training data show higher TAR, confirming the metric as a proxy for representation strength. Reasoning models respond to low TAR by generating more tokens overall, though the effect on final quality varies by model.
What carries the argument
Token Activation Rate (TAR), the proportion of language-specific tokens from the model's vocabulary that appear in the generated translation, used as a proxy for how well the target language is represented internally.
If this is right
- Non-English-centric language pairs suffer consistent quality losses that scale with reduced TAR.
- Reasoning models increase generation length as a partial response to low TAR, with uneven effects on final output quality.
- TAR values match expected language coverage from training data, allowing prediction of which pairs will underperform.
- Token-level activation offers a diagnostic that distinguishes representation issues from other sources of translation failure.
Where Pith is reading between the lines
- If TAR is the operative mechanism, lightweight interventions that encourage activation of specific tokens could improve low-resource results without retraining the full model.
- The same under-activation pattern may limit performance in other generation tasks involving underrepresented languages.
- Different tokenizers or vocabulary designs could be compared by their TAR profiles to isolate whether the problem is model architecture or token inventory.
Load-bearing premise
That the measured association between low TAR and poor translation performance reflects a causal role for token utilization rather than a byproduct of other differences such as data quality or model size.
What would settle it
A controlled change that raises TAR for a low-performing language pair (for example by vocabulary intervention) yet leaves COMET scores unchanged, or that produces high TAR without corresponding quality gains, would falsify the explanatory claim.
Figures
read the original abstract
Large Language Models (LLMs) have recently demonstrated strong performance in machine translation (MT). However, most prior work focuses on improving or benchmarking translation quality, offering limited insight into when and why LLM-based translation fails. In this work, we systematically analyze failure modes of LLMs in MT by evaluating 15 models, including four reasoning LLMs, across 22 language pairs (LPs) with varying resource levels. We find that non-English-centric LPs consistently yield lower COMET scores than English-centric pairs. To investigate the underlying causes, we introduce Token Activation Rate (TAR), a metric that captures how effectively a model utilizes language-specific tokens in its vocabulary during generation. We validate TAR as a proxy for language representation using models with known language distributions in the training data, and show that lower TAR is strongly associated with poorer translation performance. Furthermore, reasoning LLMs tend to generate more tokens when translating into low-TAR languages, suggesting a compensatory mechanism, although its impact on translation quality varies across models. Overall, our findings emphasize the importance of token-level dynamics in understanding MT performance of LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates 15 LLMs (including four reasoning models) on machine translation across 22 language pairs of varying resource levels. It reports that non-English-centric pairs yield lower COMET scores, introduces Token Activation Rate (TAR) as the fraction of language-specific tokens activated during generation, validates TAR as a proxy for language representation quality against known training-data distributions, and finds that lower TAR is strongly associated with poorer translation performance. Reasoning LLMs are observed to generate more tokens when translating into low-TAR languages, interpreted as a compensatory mechanism.
Significance. If the TAR metric can be shown to capture an independent mechanism after appropriate controls, the work would provide a useful token-level explanation for LLM MT failures in low-resource settings and could inform tokenizer design or training strategies. The multi-model evaluation and the attempt to link generation statistics to external training knowledge are constructive elements. The primarily correlational evidence currently limits the strength of the explanatory claims.
major comments (3)
- [§5] §5 (TAR-COMET association): the reported strong negative association between TAR and COMET scores across the 22 LPs does not include controls for resource level or data volume; low-resource and non-English-centric pairs are definitionally expected to exhibit both lower token coverage and lower quality, so the association may be a byproduct rather than evidence of a distinct token-dynamics mechanism. Partial correlation or stratification by resource level is needed to support the central claim.
- [TAR validation] TAR validation paragraph: while TAR is validated using models with known language distributions in training data, no quantitative statistics (correlation coefficients, p-values, or confidence intervals) or details on the exact identification of language-specific tokens are provided, weakening the claim that TAR serves as a reliable proxy independent of tokenizer artifacts.
- [Reasoning LLM analysis] Reasoning-LLM analysis: the observation that reasoning models generate more tokens for low-TAR languages is presented descriptively, but the manuscript does not quantify the effect on COMET scores, test its statistical significance, or compare it against non-reasoning baselines to isolate TAR's causal role.
minor comments (3)
- [Model description] A table listing all 15 models with their parameter counts, training details, and whether they are reasoning models would improve reproducibility and clarity.
- [Abstract] The abstract introduces TAR without a concise definition; moving a one-sentence definition to the abstract would help readers.
- [Figures] Figures showing TAR vs. COMET should include error bars or per-pair variability to allow assessment of robustness across the 22 LPs.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which have helped us clarify and strengthen the presentation of our findings on token dynamics in LLM-based machine translation. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [§5] §5 (TAR-COMET association): the reported strong negative association between TAR and COMET scores across the 22 LPs does not include controls for resource level or data volume; low-resource and non-English-centric pairs are definitionally expected to exhibit both lower token coverage and lower quality, so the association may be a byproduct rather than evidence of a distinct token-dynamics mechanism. Partial correlation or stratification by resource level is needed to support the central claim.
Authors: We agree that resource level and data volume represent important potential confounders, as non-English-centric pairs frequently align with lower-resource settings. In the revised manuscript, we have added stratification of the 22 language pairs into high-, medium-, and low-resource categories based on established MT benchmarks, along with partial correlation analysis between TAR and COMET scores while controlling for estimated training data volume (using publicly available corpus size proxies). The negative association between TAR and COMET remains statistically significant both within resource strata and after partialling out data volume effects (updated results and figures now appear in §5). This supports TAR capturing token-utilization dynamics that are not fully reducible to data availability. revision: yes
-
Referee: [TAR validation] TAR validation paragraph: while TAR is validated using models with known language distributions in training data, no quantitative statistics (correlation coefficients, p-values, or confidence intervals) or details on the exact identification of language-specific tokens are provided, weakening the claim that TAR serves as a reliable proxy independent of tokenizer artifacts.
Authors: We acknowledge that the original validation relied primarily on qualitative alignment with known training distributions. The revised manuscript now includes quantitative statistics: Pearson correlation coefficients (with p-values and 95% confidence intervals) between TAR values and log-scaled training data volumes for the evaluated models. We have also added explicit details on language-specific token identification, which combines tokenizer vocabulary metadata, Unicode script ranges, and frequency-based language assignment from the models' pretraining corpora (excluding cross-lingual shared tokens). These additions are incorporated into the TAR validation section and demonstrate that TAR tracks language representation beyond tokenizer-specific artifacts. revision: yes
-
Referee: [Reasoning LLM analysis] Reasoning-LLM analysis: the observation that reasoning models generate more tokens for low-TAR languages is presented descriptively, but the manuscript does not quantify the effect on COMET scores, test its statistical significance, or compare it against non-reasoning baselines to isolate TAR's causal role.
Authors: We accept that the reasoning-LLM analysis was initially descriptive. The revision quantifies the mean increase in generated tokens for low-TAR versus high-TAR target languages within the four reasoning models, reports effect sizes, and includes t-tests confirming statistical significance of the difference. We further compare token counts and resulting COMET scores against the non-reasoning models in our 15-model evaluation. These expanded results are now presented with tables in the relevant section. However, fully isolating a causal role for TAR would require targeted interventions such as tokenizer retraining or controlled ablation studies, which lie outside the observational scope of this work; we have therefore framed the compensatory mechanism as correlational and model-dependent rather than causal. revision: partial
Circularity Check
No significant circularity: TAR validated externally and association presented as empirical observation
full rationale
The paper defines TAR directly from observed generation statistics on language-specific tokens, validates it as a proxy by comparing against independently known training-data language distributions (external benchmark), and reports the negative association with COMET scores as a separate empirical finding across 22 LPs. No equations reduce a claimed prediction to a fitted input by construction, no self-citation chains carry load-bearing premises, and no ansatz or uniqueness result is smuggled in. The chain is self-contained against external knowledge of training distributions and standard MT metrics.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption COMET scores accurately reflect translation quality differences across language pairs
- domain assumption Token activation during generation is a direct indicator of how well the model represents the target language
invented entities (1)
-
Token Activation Rate (TAR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ahn, Janice, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. 2024. Large language models for mathematical reasoning: Progresses and challenges. In Falk, Neele, Sara Papi, and Mike Zhang, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop , pages 225--2...
work page 2024
-
[2]
Apertus Project , Alejandro Hern \'a ndez-Cano, Alexander H \"a gele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank D urech, Ido Hakimi, Juan Garc \' a Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabol c ec, Yixuan Xu, Michael Aerni, Badr AlKhamis...
work page 2025
-
[3]
Barrault, Lo \"i c, Magdalena Biesialska, Ond r ej Bojar, Marta R. Costa-juss \`a , Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljube s i \'c , Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. 2...
work page 2020
-
[4]
BigScience Workshop , Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili \'c , Daniel Hesslow, Roman Castagn \'e , Alexandra Sasha Luccioni, Fran c ois Yvon, Matthias Gall \'e , Jonathan Tow, Alexander M Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Beno \^ t Sagot, Niklas Muennighoff, Albert Villanova...
work page 2022
-
[5]
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics , 5:135--146
work page 2017
-
[6]
Castaldo, Antonio and Johanna Monti. 2024. Prompting large language models for idiomatic translation. In Vanroy, Bram, Marie-Aude Lefer, Lieve Macken, and Paola Ruffo, editors, Proceedings of the 1st Workshop on Creative-text Translation and Technology , pages 32--39, Sheffield, United Kingdom, June. European Association for Machine Translation
work page 2024
-
[7]
Caswell, Isaac. 2024. 110 new languages are coming to Google Translate . Accessed on 10, Dec 2025
work page 2024
-
[8]
Chizhov, Pavel, Catherine Arnett, Elizaveta Korotkova, and Ivan P. Yamshchikov. 2024. BPE gets picky: Efficient vocabulary refinement during tokenizer training. In Al-Onaizan, Yaser, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 16587--16604, Miami, Florida, USA, No...
work page 2024
-
[9]
Court, Sara and Micha Elsner. 2024. Shortcomings of LLM s for low-resource translation: Retrieval and understanding are both the problem. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Ninth Conference on Machine Translation , pages 1332--1354, Miami, Florida, USA, November. Association for Computational Linguistics
work page 2024
-
[10]
Dang, John, Shivalika Singh, Daniel D'souza, Arash Ahmadian, Alejandro Salamanca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, Sandra Kublik, Meor Amer, Viraat Aryabumi, Jon Ander Campos, Yi-Chern Tan, Tom Kocmi, Florian Strub, Nathan Grinsztajn, Yannis Flet-Berliac, Acyr Locatelli, Hangyu Lin, Dwarak Talupuru, Bharat Ven...
work page 2024
-
[11]
DeepSeek-AI. 2025. Deepseek-v3.2-exp: Boosting long-context efficiency with deepseek sparse attention. Accessed on 08, Dec 2025
work page 2025
-
[12]
Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-Bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Ga \"e l Liu, Francesco Visin, Kathleen Kenea...
work page 2025
-
[13]
Groeneveld, Dirk, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Cr...
work page 2024
-
[14]
Guo, Daya, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z F Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, E...
work page 2025
-
[15]
Guzm \'a n, Francisco, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc ' Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: N epali -- E nglish and S inhala -- E nglish. In Inui, Kentaro, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Confe...
work page 2019
-
[16]
He, Sui. 2024. Prompting C hat GPT for translation: A comparative analysis of translation brief and persona prompts. In Scarton, Carolina, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, V \'i ctor M S \'a nchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Ca...
work page 2024
-
[17]
Hendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding . In International Conference on Learning Representations
work page 2021
-
[18]
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint , March
work page 2015
-
[19]
Hirak, Vitalii, Jaap Jumelet, and Arianna Bisazza. 2026. Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models . In Demberg, Vera, Kentaro Inui, and Llu \'i s Marquez, editors, Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volu...
work page 2026
-
[20]
Huang, Xu, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, and Fei Yuan. 2025. B ench MAX : A comprehensive multilingual evaluation suite for large language models. In Christodoulopoulos, Christos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025 , pages 16751--16774...
work page 2025
-
[21]
Jiang, Juyong, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2025. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol. , July. Just Accepted
work page 2025
-
[22]
K2 Team , Zhengzhong Liu, Liping Tang, Linghao Jin, Haonan Li, Nikhil Ranjan, Desai Fan, Shaurya Rohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Seungwook Han, Bowen Tan, Gurpreet Gosal, Xudong Han, Varad Pimpalkhute, Shibo Hao, Ming Shan Hee, Joel Hestness, Haolong Jia, Liqun Ma, Aaryamonvikram Singh, Daria Soboleva, Natalia...
work page 2026
-
[23]
Seza Do g ru \"o z, and En-Shiun Lee
Khiu, Eric, Hasti Toossi, David Anugraha, Jinyu Liu, Jiaxu Li, Juan Flores, Leandro Roman, A. Seza Do g ru \"o z, and En-Shiun Lee. 2024. Predicting machine translation performance on low-resource languages: The role of domain similarity. In Graham, Yvette and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024 , ...
work page 2024
-
[24]
Kocmi, Tom, Eleftherios Avramidis, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Kenton Murray, Masaaki Nagata, Martin Popel, Maja Popovi \'c , Mariya Shmatova, Steinth \'o r Steingr \'i msso...
work page 2024
-
[25]
Kulkarni, Ajinkya. 2015. TED Multilingual Parallel Corpus . GitHub, 12. Accessed on 08, Dec 2025
work page 2015
-
[26]
Kwon, Woosuk, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles , SOSP '23, page 611–626, New York, NY, USA. Association for Computing Machinery
work page 2023
-
[27]
Lambert, Nathan, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. ...
work page 2024
-
[28]
Lannelongue, Loïc, Jason Grealey, and Michael Inouye. 2021. Green algorithms: Quantifying the carbon footprint of computation. Advanced Science , 8(12):2100707
work page 2021
-
[29]
Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin
Littell, Patrick, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and Lori Levin. 2017. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Lapata, Mirella, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computa...
work page 2017
-
[30]
Liu, Sinuo, Chenyang Lyu, Minghao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, and Zifu Shang. 2025. New trends for modern machine translation with large reasoning models. arXiv preprint , March
work page 2025
-
[31]
Lundin, Jessica M, Ada Zhang, Nihal Karim, Hamza Louzan, Victor Wei, David Adelani, and Cody Carroll. 2025. The token tax: Systematic bias in multilingual tokenization. arXiv preprint , September
work page 2025
-
[32]
Martins, Pedro Henrique, Patrick Fernandes, Jo \ a o Alves, Nuno M Guerreiro, Ricardo Rei, Duarte M Alves, Jos \'e Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, Jos \'e G C de Souza, Alexandra Birch, and Andr \'e F T Martins. 2024. EuroLLM : Multilingual language models for europe. arXiv preprint , September
work page 2024
-
[33]
Meta AI . 2024. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models . Accessed on 08, Dec 2025
work page 2024
-
[34]
NLLB Team , Marta R Costa-juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk ...
work page 2022
-
[35]
Olmo Team , Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Sha...
work page 2025
-
[36]
OpenAI , Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich...
work page 2024
-
[37]
OpenAI . 2024. Introducing SWE-bench Verified . Accessed on 11, Dec 2025
work page 2024
-
[38]
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. B leu: a method for automatic evaluation of machine translation. In Isabelle, Pierre, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages 311--318, Philadelphia, Pennsylvania, USA, July. Association for ...
work page 2002
-
[39]
Park, Jeonghyeok and Hai Zhao. 2019. Korean-to-Chinese Machine Translation using Chinese Character as Pivot Clue . arXiv preprint , November
work page 2019
-
[40]
Phan, Long, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes,...
work page 2025
-
[41]
Ploeger, Esther, Johannes Bjerva, J \"o rg Tiedemann, and Robert Oestling. 2025. A cross-lingual perspective on neural machine translation difficulty. In Haddow, Barry, Tom Kocmi, Philipp Koehn, and Christof Monz, editors, Proceedings of the Tenth Conference on Machine Translation , pages 340--354, Suzhou, China, November. Association for Computational Li...
work page 2025
-
[42]
Popovi \'c , Maja. 2017. chr F ++: words helping character n-grams. In Bojar, Ond r ej, Christian Buck, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer, editors, Proceedings of the Second Conference on Machine Translation , pages 612--618, Copenhagen, Denmark, Septe...
work page 2017
-
[43]
Post, Matt. 2018. A call for clarity in reporting BLEU scores. In Bojar, Ond r ej, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aur \'e lie N \'e v \'e ol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor, editors, P...
work page 2018
-
[44]
Provilkov, Ivan, Dmitrii Emelianenko, and Elena Voita. 2020. BPE -dropout: Simple and effective subword regularization. In Jurafsky, Dan, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 1882--1892, Online, July. Association for Computational Linguistics
work page 2020
-
[45]
Pu, Xiao, Mingqi Gao, and Xiaojun Wan. 2023. Summarization is (almost) dead. arXiv preprint , September
work page 2023
-
[46]
Qwen Team . 2024. Qwen2.5: A Party of Foundation Models! , 9. Accessed on 08, Dec 2025
work page 2024
-
[47]
Qwen Team . 2025. Qwen3: Think Deeper, Act Faster , 4. Accessed on 08, Dec 2025
work page 2025
-
[48]
Rei, Ricardo, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. COMET : A neural framework for MT evaluation. In Webber, Bonnie, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 2685--2702, Online, November. Association for Computational Linguistics
work page 2020
-
[49]
Rei, Ricardo, Jos \'e G. C. de Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and Andr \'e F. T. Martins. 2022a. COMET -22: Unbabel- IST 2022 submission for the metrics shared task. In Koehn, Philipp, Lo \"i c Barrault, Ond r ej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-juss \`a , Christian ...
work page 2022
-
[50]
Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, Jos \'e G
Rei, Ricardo, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, Jos \'e G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and Andr \'e F. T. Martins. 2022b. C omet K iwi: IST -unbabel 2022 submission for the quality estimation shared task. In Koehn, Philipp, Lo \"i c Barrault, Ond r ej Bojar, Fet...
work page 2022
-
[51]
Rei, Ricardo, Nuno M Guerreiro, Jos \'e Pombal, Jo \ a o Alves, Pedro Teixeirinha, Amin Farajian, and Andr \'e F T Martins. 2025. Tower+: Bridging generality and translation specialization in multilingual LLMs . arXiv preprint , June
work page 2025
-
[52]
Romanou, Angelika, Negar Foroutan, Anna Sotnikova, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Zeming Chen, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Flo...
work page 2025
-
[53]
Rust, Phillip, Jonas Pfeiffer, Ivan Vuli \'c , Sebastian Ruder, and Iryna Gurevych. 2021. How good is your tokenizer? on the monolingual performance of multilingual language models. In Zong, Chengqing, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inte...
work page 2021
-
[54]
Scherrer, Yves, Luka Nerima, Lorenza Russo, Maria Ivanova, and Eric Wehrli. 2014. S wiss A dmin: A multilingual tagged parallel corpus of press releases. In Calzolari, Nicoletta, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth Internation...
work page 2014
-
[55]
Schwartz, Roy, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green ai. Commun. ACM , 63(12):54–63, November
work page 2020
-
[56]
Sindhujan, Archchana, Diptesh Kanojia, Constantin Orasan, and Shenbin Qian. 2025. When LLM s struggle: Reference-less translation evaluation for low-resource languages. In Hettiarachchi, Hansi, Tharindu Ranasinghe, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, and Lasitha Uyangodage, editors, Proceedings of the First Works...
work page 2025
-
[57]
Singh, Telem Joyson, Ranbir Singh Sanasam, and Priyankoo Sarmah. 2025. An information-theoretic approach to reducing fertility in LLM s for M anipuri machine translation. In Inui, Kentaro, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors, Proceedings of ...
work page 2025
-
[58]
Song, Yewei, Lujun Li, Cedric Lothritz, Saad Ezzini, Lama Sleem, Niccolo Gentile, Radu State , Tegawend \'e F Bissyand \'e , and Jacques Klein. 2025. Is small language model the silver bullet to low-resource languages machine translation? arXiv preprint , August
work page 2025
-
[59]
Stewart, Craig, Ricardo Rei, Catarina Farinha, and Alon Lavie. 2020. COMET - deploying a new state-of-the-art MT evaluation metric in production. In Campbell, Janice, Dmitriy Genzel, Ben Huyck, and Patricia O ' Neill-Brown, editors, Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track) , pages...
work page 2020
-
[60]
Vilar, David, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. 2023. Prompting P a LM for translation: Assessing strategies and performance. In Rogers, Anna, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages ...
work page 2023
-
[61]
Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: a stickier benchmark for general-purpose language understanding systems . In Proceedings of the 33rd International Conference on Neural Information Processing Systems , Red Hook, NY, USA. Curran Associates Inc
work page 2019
-
[62]
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the...
work page 2020
-
[63]
Ye, Yongshi, Biao Fu, Chongxuan Huang, Yidong Chen, and Xiaodong Shi. 2025. How well do large reasoning models translate? a comprehensive evaluation for multi-domain machine translation. arXiv preprint , May
work page 2025
-
[64]
Yue, Xiang, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. 2025. MMMU -pro: A more robust multi-discipline multimodal understanding benchmark. In Che, Wanxiang, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd A...
work page 2025
-
[65]
Zhang, Biao, Barry Haddow, and Alexandra Birch. 2023. Prompting large language model for machine translation: a case study . In Proceedings of the 40th International Conference on Machine Learning , ICML'23. JMLR.org
work page 2023
-
[66]
Zhang, Wenxuan, Yue Deng, Bing Liu, Sinno Pan, and Lidong Bing. 2024. Sentiment analysis in the era of large language models: A reality check. In Duh, Kevin, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024 , pages 3881--3906, Mexico City, Mexico, June. Association for Computational Linguistics
work page 2024
-
[67]
Zhang, Biao, Fedor Moiseev, Joshua Ainslie, Paul Suganthan, Min Ma, Surya Bhupatiraju, Fede Lebron, Orhan Firat, Armand Joulin, and Zhe Dong. 2025. Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation . arXiv preprint , April
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.