pith. sign in

arxiv: 2605.24681 · v1 · pith:IMPSEHPVnew · submitted 2026-05-23 · 💻 cs.CL · cs.AI

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Pith reviewed 2026-06-30 13:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords mixture of expertsmultilingual machine translationlarge language modelsparameter interferencefourier routingpost-pretraining
0
0 comments X

The pith

Mix-MoE splits LLM experts into monolingual and translation groups to reduce parameter interference in multilingual MT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to fine-tune large language models for translating between many languages without the usual loss of original language skills. It uses a mixed Mixture-of-Experts setup in two stages, first reinforcing monolingual knowledge then adding translation knowledge through separate expert groups. A routing system based on Fourier transforms helps the experts coordinate. If successful, this means models can learn translations more effectively while keeping their general language understanding intact. The approach directly targets the problem of parameters conflicting when training on both single-language and translation data.

Core claim

Mix-MoE divides the MoE layers into LM Experts trained solely on monolingual corpora to preserve pre-trained knowledge and MT Experts trained on parallel corpora to learn translation abilities, using a Fourier Transform-enhanced routing mechanism for better expert interaction, leading to improved multilingual MT performance and reduced parameter interference.

What carries the argument

The split of MoE layers into Language Model Experts and Machine Translation Experts combined with Fourier Transform routing.

If this is right

  • Multilingual machine translation performance will increase beyond standard fine-tuning methods.
  • Parameter interference between monolingual and bilingual training will decrease.
  • Models will retain more of their original monolingual capabilities after translation training.
  • The Fourier routing will improve how experts are selected based on text structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could extend to other fine-tuning scenarios where general and specialized knowledge need separation.
  • Fixed expert groups might be replaced with learned groupings in future versions for more flexibility.
  • Similar expert splitting could apply to tasks like multilingual question answering or summarization.

Load-bearing premise

That dividing experts into fixed monolingual and translation groups with Fourier routing will prevent interference without causing new problems or wasting model capacity.

What would settle it

Finding that a Mix-MoE model shows the same degradation in monolingual task performance after parallel data training as a standard fine-tuned model would indicate the claim is not holding.

Figures

Figures reproduced from arXiv: 2605.24681 by Bo Li, Deyi Xiong, Shaolin Zhu, Tianyu Dong.

Figure 1
Figure 1. Figure 1: The architecture overview of Mix-MoE. We transform a dense LLM into a MoE model that includes two groups of experts: Language Model Experts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Selection Proportion for each LM and MT Expert across in all 14 translation tasks. From left to right, they are Experts 0 to Experts 3. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of Expert Number on BLEU for EN-ZH and ZH-EN Translation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BLEU improvement achieved on other language pairs using the Mix [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: BLEU improvement achieved on other language pairs using the Mix [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of BLEU scores across 14 language directions for [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mix-MoE, a two-stage post-pretraining Mixture-of-Experts framework for multilingual machine translation in LLMs. Stage 1 applies MoE to monolingual corpora with LM Experts; stage 2 applies MoE to parallel corpora with MT Experts. The MoE layers are partitioned into these two fixed groups, and a Fourier Transform feature-enhanced router is introduced to enable interaction while mitigating parameter interference. The central claim is that this design yields significant gains over baselines in multilingual MT and reduces interference.

Significance. If the empirical claims hold with proper controls and ablations, the work would offer a concrete architectural approach to separating monolingual and bilingual knowledge in LLM fine-tuning for MT, which could be useful for scaling multilingual capabilities without proportional increases in interference.

major comments (2)
  1. [Abstract / framework] Abstract and framework description: the central claim that the fixed LM/MT expert split plus Fourier routing reduces parameter interference rests on the unverified assumption that the router will not route monolingual tokens to MT Experts (or vice versa) at inference; no analysis, routing statistics, or ablation is supplied to show that cross-group routing is prevented or that capacity does not collapse.
  2. [Abstract] Abstract: the assertion of 'significantly outperforming existing baselines' and 'notable progress in mitigating parameter interference' is stated without any reported metrics, baselines, language pairs, model sizes, or statistical significance; the soundness of the headline result cannot be assessed from the supplied evidence.
minor comments (2)
  1. [Method] The precise mathematical form of the Fourier features added to the router and how they are combined with the standard gating function should be stated explicitly (e.g., as an equation).
  2. [Method] Clarify whether the two expert groups share any parameters or whether the router is trained jointly or in separate phases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point by point to the major comments and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract / framework] Abstract and framework description: the central claim that the fixed LM/MT expert split plus Fourier routing reduces parameter interference rests on the unverified assumption that the router will not route monolingual tokens to MT Experts (or vice versa) at inference; no analysis, routing statistics, or ablation is supplied to show that cross-group routing is prevented or that capacity does not collapse.

    Authors: We agree that the current manuscript lacks explicit routing statistics, cross-group routing analysis, or dedicated ablations to verify that the router largely respects the LM/MT expert partition at inference. The architectural intent is that the fixed grouping plus Fourier features limits harmful interference while still permitting interaction, but this remains an assumption without direct empirical support in the submitted version. We will add routing distribution statistics across expert groups and an ablation isolating cross-group routing in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of 'significantly outperforming existing baselines' and 'notable progress in mitigating parameter interference' is stated without any reported metrics, baselines, language pairs, model sizes, or statistical significance; the soundness of the headline result cannot be assessed from the supplied evidence.

    Authors: The abstract is written as a high-level summary; all quantitative results, baselines, language pairs, model sizes, and significance tests appear in the experimental sections of the full manuscript. This follows common practice for abstracts. If the editor prefers, we can insert a small number of key headline metrics into the abstract during revision. revision: partial

Circularity Check

0 steps flagged

No circularity: Mix-MoE is an empirical architecture proposal with no derivation chain

full rationale

The paper proposes a two-stage post-pretraining process, a fixed split of MoE layers into LM Experts (monolingual data) and MT Experts (parallel data), and a Fourier-feature router. These are explicit design choices presented as solutions to parameter interference. No equations, fitted parameters, or first-principles derivations are described that reduce to self-defined quantities or self-citations. Claims rest on experimental comparisons to baselines, which are external evaluations rather than internal reductions. The method is therefore self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that the proposed expert split and routing will separate knowledge types without new interference; no free parameters, axioms, or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Specialized LM and MT expert groups can be trained sequentially on monolingual then parallel data without destructive interference.
    This premise is invoked to justify the two-stage design and the division of MoE layers.
invented entities (1)
  • LM Experts and MT Experts no independent evidence
    purpose: To separately store monolingual knowledge and bilingual translation knowledge inside the same MoE layers.
    These are introduced as distinct groups of experts within the MoE layers.

pith-pipeline@v0.9.1-grok · 5741 in / 1271 out tokens · 26888 ms · 2026-06-30T13:20:02.968891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 15 canonical work pages · 6 internal anchors

  1. [1]

    Multilingual mix: Example interpolation improves multilingual neural machine translation,

    Y . Cheng, A. Bapna, O. Firat, Y . Cao, P. Wang, and W. Macherey, “Multilingual mix: Example interpolation improves multilingual neural machine translation,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicenc...

  2. [2]

    Towards higher pareto frontier in multilingual machine translation,

    Y . Huang, X. Feng, X. Geng, B. Li, and B. Qin, “Towards higher pareto frontier in multilingual machine translation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational...

  3. [3]

    Neural machine translation by jointly learning to align and translate,

    D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7- 9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015

  4. [4]

    New trends in machine translation using large language models: Case examples with chatgpt,

    C. Lyu, J. Xu, and L. Wang, “New trends in machine translation using large language models: Case examples with chatgpt,”CoRR, vol. abs/2305.01181, 2023

  5. [5]

    Multilingual machine translation with large language models: Empirical results and analysis,

    W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Multilingual machine translation with large language models: Empirical results and analysis,” inFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. G ´omez-Adorno, and S. Bethard, Eds. Association for Computational L...

  6. [6]

    Revolutionising translation with ai: Unravelling neural machine translation and generative pre-trained large language models,

    S. C. Siu, “Revolutionising translation with ai: Unravelling neural machine translation and generative pre-trained large language models,” inNew Advances in Translation Technology: Applications and Pedagogy. Springer, 2024, pp. 29–54

  7. [7]

    Continual learning with semi-supervised contrastive distillation for incremental neural machine translation,

    Y . Liang, F. Meng, J. Wang, J. Xu, Y . Chen, and J. Zhou, “Continual learning with semi-supervised contrastive distillation for incremental neural machine translation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and ...

  8. [8]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    Y . Luo, Z. Yang, F. Meng, Y . Li, J. Zhou, and Y . Zhang, “An empir- ical study of catastrophic forgetting in large language models during continual fine-tuning,”CoRR, vol. abs/2308.08747, 2023

  9. [9]

    Simple and scalable strategies to continually pre-train large language models,

    A. Ibrahim, B. Th ´erien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish, “Simple and scalable strategies to continually pre-train large language models,”Trans. Mach. Learn. Res., vol. 2024, 2024

  10. [10]

    Breaking the script barrier in multilingual pre-trained language models with transliteration-based post- training alignment,

    O. Xhelili, Y . Liu, and H. Sch ¨utze, “Breaking the script barrier in multilingual pre-trained language models with transliteration-based post- training alignment,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y . Chen, Eds. Association for Computational...

  11. [11]

    Towards incremental learning in large language models: A critical review,

    M. Jovanovic and P. V oss, “Towards incremental learning in large language models: A critical review,”CoRR, vol. abs/2404.18311, 2024

  12. [12]

    Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing,

    H. Zhou, Z. Wang, S. Huang, X. Huang, X. Han, J. Feng, C. Deng, W. Luo, and J. Chen, “Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing,”CoRR, vol. abs/2408.11396, 2024

  13. [13]

    Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models,

    S. Zhu, L. Pan, D. Jian, and D. Xiong, “Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models,”Information Processing & Management, vol. 62, no. 3, p. 104078, 2025

  14. [14]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S...

  15. [15]

    Transfer learning for low- resource neural machine translation,

    B. Zoph, D. Yuret, J. May, and K. Knight, “Transfer learning for low- resource neural machine translation,” inProceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh, Eds. The Association for Computational Linguistics, 2016, pp. 1568–1575

  16. [16]

    Rapid adaptation of neural machine translation to new languages,

    G. Neubig and J. Hu, “Rapid adaptation of neural machine translation to new languages,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Association for Computational Linguistics, 2018, pp. 875–880

  17. [17]

    Improving neural machine translation models with monolingual data,

    R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016

  18. [18]

    Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders,

    G. Chen, S. Ma, Y . Chen, L. Dong, D. Zhang, J. Pan, W. Wang, and F. Wei, “Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens...

  19. [19]

    Towards robust in-context learning for machine translation with large language models,

    S. Zhu, M. Cui, and D. Xiong, “Towards robust in-context learning for machine translation with large language models,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari, M. Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, E...

  20. [20]

    On the multilingual ability of decoder-based pre-trained language mod- els: Finding and controlling language-specific neurons,

    T. Kojima, I. Okimura, Y . Iwasawa, H. Yanaka, and Y . Matsuo, “On the multilingual ability of decoder-based pre-trained language mod- els: Finding and controlling language-specific neurons,”CoRR, vol. abs/2404.02431, 2024

  21. [21]

    A paradigm shift: The future of machine translation lies with large language models,

    C. Lyu, Z. Du, J. Xu, Y . Duan, M. Wu, T. Lynn, A. F. Aji, D. F. Wong, and L. Wang, “A paradigm shift: The future of machine translation lies with large language models,” inProceedings of the 2024 Joint Interna- tional Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari...

  22. [22]

    Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation,

    J. Waldendorf, A. Birch, B. Hadow, and A. V . M. Barone, “Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation,” inProceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), AMTA 2022, Orlando, USA, September 12- 16, 20...

  23. [23]

    Exploiting domain-specific par- allel data on multilingual language models for low-resource language translation,

    S. Ranathunga, S. Nayak, S. C. Huang, Y . Mao, T. Su, Y . R. Chan, S. Yuan, A. Rinaldi, and E. A. Lee, “Exploiting domain-specific par- allel data on multilingual language models for low-resource language translation,”CoRR, vol. abs/2412.19522, 2024

  24. [24]

    Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation,

    T. Zhang, X. Wang, B. Liang, and B. Yuan, “Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation,”IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 12, pp. 9925–9939, 2023

  25. [25]

    Understanding catas- trophic forgetting in language models via implicit inference,

    S. Kotha, J. M. Springer, and A. Raghunathan, “Understanding catas- trophic forgetting in language models via implicit inference,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  26. [26]

    Efficient continual pre-training for building domain specific large language models,

    Y . Xie, K. Aggarwal, and A. Ahmad, “Efficient continual pre-training for building domain specific large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 10 184–10 201

  27. [27]

    Overcoming catastrophic forgetting in graph neural networks,

    H. Liu, Y . Yang, and X. Wang, “Overcoming catastrophic forgetting in graph neural networks,” inThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli- cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, ...

  28. [28]

    A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

    D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with lora,”CoRR, vol. abs/2312.03732, 2023

  29. [29]

    arXiv preprint arXiv:2112.10684 , year=

    M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V . Lin, J. Du, S. Iyer, R. Pasunuruet al., “Efficient large scale language modeling with mixtures of experts,”arXiv preprint arXiv:2112.10684, 2021

  30. [30]

    Distributed learning of mixtures of experts,

    F. Chamroukhi and N. T. Pham, “Distributed learning of mixtures of experts,”CoRR, vol. abs/2312.09877, 2023

  31. [31]

    Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts,

    H. Yu, Z. Qi, L. Jang, R. Salakhutdinov, L. Morency, and P. P. Liang, “Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y . Chen, Eds. Association for ...

  32. [32]

    From sparse to soft mixtures of experts,

    J. Puigcerver, C. R. Ruiz, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  33. [33]

    Efficient large scale language modeling with mixtures of experts,

    M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V . Lin, J. Du, S. Iyer, R. Pasunuru, G. Anantharaman, X. Li, S. Chen, H. Akin, M. Baines, L. Martin, X. Zhou, P. S. Koura, B. O’Horo, J. Wang, L. Zettlemoyer, M. T. Diab, Z. Kozareva, and V . Stoyanov, “Efficient large scale language modeling with mixtures of experts,” in Proceedings ...

  34. [34]

    A paradigm shift in machine translation: Boosting translation performance of large language models,

    H. Xu, Y . J. Kim, A. Sharaf, and H. H. Awadalla, “A paradigm shift in machine translation: Boosting translation performance of large language models,” inThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  35. [35]

    X-ALMA: plug & play modules and adaptive rejection for quality translation at scale,

    H. Xu, K. Murray, P. Koehn, H. Hoang, A. Eriguchi, and H. Khayrallah, “X-ALMA: plug & play modules and adaptive rejection for quality translation at scale,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

  36. [36]

    Digital signal processing: signals systems and filters,

    A. Antoniu, “Digital signal processing: signals systems and filters,” 2006

  37. [37]

    On the relation between linguistic typology and (limitations of) multilin- gual language modeling,

    D. Gerz, I. Vulic, E. M. Ponti, R. Reichart, and A. Korhonen, “On the relation between linguistic typology and (limitations of) multilin- gual language modeling,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, E...

  38. [38]

    Frequency of basic english grammatical structures: A corpus analysis,

    D. Roland, F. Dick, and J. L. Elman, “Frequency of basic english grammatical structures: A corpus analysis,”Journal of memory and language, vol. 57, no. 3, pp. 348–379, 2007

  39. [39]

    Deep Learning Based Natural Language Processing for End to End Speech Translation

    S. Patil, “Deep learning based natural language processing for end to end speech translation,”arXiv preprint arXiv:1808.04459, 2018

  40. [40]

    Dnn-based cross-lingual voice conversion using bottleneck features,

    M. Kiran Reddy and K. Sreenivasa Rao, “Dnn-based cross-lingual voice conversion using bottleneck features,”Neural Processing Letters, vol. 51, no. 2, pp. 2029–2042, 2020

  41. [41]

    Learn- ing a Fourier transform for linear relative positional encodings in transformers,

    K. Choromanski, S. Li, V . Likhosherstov, K. Avinava Dubey, S. Luo, D. He, Y . Yang, T. Sarlos, T. Weingarten, and A. Weller, “Learn- ing a Fourier transform for linear relative positional encodings in transformers,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research,...

  42. [42]

    Learnable fourier features for multi-dimensional spatial positional encoding,

    Y . Li, S. Si, G. Li, C.-J. Hsieh, and S. Bengio, “Learnable fourier features for multi-dimensional spatial positional encoding,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 15 816–15 829

  43. [43]

    Fourier position embedding: Enhancing attention’s periodic extension for length generalization,

    E. Hua, C. Jiang, X. Lv, K. Zhang, N. Ding, Y . Sun, B. Qi, Y . Fan, X. K. Zhu, and B. Zhou, “Fourier position embedding: Enhancing attention’s periodic extension for length generalization,”arXiv preprint arXiv:2412.17739, 2024

  44. [44]

    Transformer feed-forward layers are key-value memories,

    M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed-forward layers are key-value memories,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational L...

  45. [45]

    Knowledge neurons in pretrained transformers,

    D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowledge neurons in pretrained transformers,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Association for Computational Linguistics, 20...

  46. [46]

    A mathematical framework for transformer circuits,

    N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerlyet al., “A mathematical framework for transformer circuits,”Transformer Circuits Thread, vol. 1, no. 1, p. 12, 2021

  47. [47]

    Mixtral of Experts

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” CoRR, vol. ...

  48. [48]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022

  49. [49]

    Mixtral of experts,

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” 2024

  50. [50]

    Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training,

    X. Qu, D. Dong, X. Hu, T. Zhu, W. Sun, and Y . Cheng, “Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training,”CoRR, vol. abs/2411.15708, 2024

  51. [51]

    Llama-moe: Building mixture-of-experts from llama with continual pre- training,

    T. Zhu, X. Qu, D. Dong, J. Ruan, J. Tong, C. He, and Y . Cheng, “Llama-moe: Building mixture-of-experts from llama with continual pre- training,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 15 913–15 923

  52. [52]

    OLMoE: Open Mixture-of-Experts Language Models

    N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambertet al., “Olmoe: Open mixture- of-experts language models,”arXiv preprint arXiv:2409.02060, 2024

  53. [53]

    A call for clarity in reporting BLEU scores,

    M. Post, “A call for clarity in reporting BLEU scores,” inProceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, C. Monz, M. Negri, A. N ´ev´eol, M. L. Neves, M. Post...

  54. [54]

    METEOR: an automatic metric for MT evalu- ation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evalu- ation with improved correlation with human judgments,” inProceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, J. Goldstein, A. Lavie, C. Lin, and C. R. V oss, Eds. Associ...

  55. [55]

    COMET: A neural framework for MT evaluation,

    R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “COMET: A neural framework for MT evaluation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2685–2702

  56. [56]

    Qa-lora: Quantization-aware low-rank adaptation of large language models,

    Y . Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low-rank adaptation of large language models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

  57. [57]

    Llama pro: Progressive llama with block expansion,

    C. Wu, Y . Gan, Y . Ge, Z. Lu, J. Wang, Y . Feng, Y . Shan, and P. Luo, “Llama pro: Progressive llama with block expansion,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computat...

  58. [58]

    Flashattention: Fast and memory-efficient exact attention with io-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D....

  59. [59]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    M. R. Costa-Juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillardet al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022