Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Bo Li; Deyi Xiong; Shaolin Zhu; Tianyu Dong

arxiv: 2605.24681 · v1 · pith:IMPSEHPVnew · submitted 2026-05-23 · 💻 cs.CL · cs.AI

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Bo Li , Tianyu Dong , Shaolin Zhu , Deyi Xiong This is my paper

Pith reviewed 2026-06-30 13:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mixture of expertsmultilingual machine translationlarge language modelsparameter interferencefourier routingpost-pretraining

0 comments

The pith

Mix-MoE splits LLM experts into monolingual and translation groups to reduce parameter interference in multilingual MT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to fine-tune large language models for translating between many languages without the usual loss of original language skills. It uses a mixed Mixture-of-Experts setup in two stages, first reinforcing monolingual knowledge then adding translation knowledge through separate expert groups. A routing system based on Fourier transforms helps the experts coordinate. If successful, this means models can learn translations more effectively while keeping their general language understanding intact. The approach directly targets the problem of parameters conflicting when training on both single-language and translation data.

Core claim

Mix-MoE divides the MoE layers into LM Experts trained solely on monolingual corpora to preserve pre-trained knowledge and MT Experts trained on parallel corpora to learn translation abilities, using a Fourier Transform-enhanced routing mechanism for better expert interaction, leading to improved multilingual MT performance and reduced parameter interference.

What carries the argument

The split of MoE layers into Language Model Experts and Machine Translation Experts combined with Fourier Transform routing.

If this is right

Multilingual machine translation performance will increase beyond standard fine-tuning methods.
Parameter interference between monolingual and bilingual training will decrease.
Models will retain more of their original monolingual capabilities after translation training.
The Fourier routing will improve how experts are selected based on text structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend to other fine-tuning scenarios where general and specialized knowledge need separation.
Fixed expert groups might be replaced with learned groupings in future versions for more flexibility.
Similar expert splitting could apply to tasks like multilingual question answering or summarization.

Load-bearing premise

That dividing experts into fixed monolingual and translation groups with Fourier routing will prevent interference without causing new problems or wasting model capacity.

What would settle it

Finding that a Mix-MoE model shows the same degradation in monolingual task performance after parallel data training as a standard fine-tuned model would indicate the claim is not holding.

Figures

Figures reproduced from arXiv: 2605.24681 by Bo Li, Deyi Xiong, Shaolin Zhu, Tianyu Dong.

**Figure 1.** Figure 1: The architecture overview of Mix-MoE. We transform a dense LLM into a MoE model that includes two groups of experts: Language Model Experts [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Selection Proportion for each LM and MT Expert across in all 14 translation tasks. From left to right, they are Experts 0 to Experts 3. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of Expert Number on BLEU for EN-ZH and ZH-EN Translation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: BLEU improvement achieved on other language pairs using the Mix [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: BLEU improvement achieved on other language pairs using the Mix [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of BLEU scores across 14 language directions for [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. However, fine-tuning LLMs with parallel corpora presents major challenges, namely parameter interference. To address these issues, we propose Mix-MoE, a mixed Mixture-of-Experts framework designed to train LLMs for multilingual MT. Our framework operates in two distinct stages: (1) post-pretraining with MoE on monolingual corpora, and (2) post-pretraining with MoE on parallel corpora. Crucially, we divide the MoE layers into two specialized groups: Language Model Experts (LM Experts) and Machine Translation Experts (MT Experts). LM Experts are designed to capture and retain the monolingual knowledge learned by the pre-trained LLM. MT Experts, on the other hand, are specifically trained to acquire and store bilingual translation knowledge. Furthermore, to facilitate effective interaction between these specialized experts and leverage potential underlying structural patterns in text, we introduce a routing mechanism enhanced by Fourier Transform features derived from model representations. The experimental results demonstrate that Mix-MoE excels in multilingual MT, significantly outperforming existing baselines and showing notable progress in mitigating parameter interference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mix-MoE adds a fixed LM/MT expert split and Fourier routing on top of two-stage post-pretraining, but the abstract supplies no results to show the split actually reduces interference.

read the letter

The new piece is the explicit division of MoE layers into LM Experts updated only on monolingual data and MT Experts updated only on parallel data, run in two separate post-pretraining stages, with Fourier features added to the router. That construction is a direct attempt to keep monolingual knowledge from being overwritten during translation fine-tuning.

The approach targets a known pain point in multilingual LLM adaptation. Separating the expert pools by data type is a clean way to try to limit cross-task interference without changing the base model architecture much.

The problem is that the abstract states outperformance and interference reduction but gives no numbers, no baselines, no training details, and no routing statistics. Without those, there is no way to check whether the router actually sends tokens to the intended expert group or whether one group ends up idle. The stress-test concern holds on the supplied information: a fixed split does not by itself guarantee isolation if the router can still mix the groups at inference.

The paper is aimed at researchers already working on MoE routing for machine translation. A reader who follows that literature might pick up the two-stage split idea, but the lack of evidence makes it hard to judge whether the Fourier addition or the expert division is doing the work.

I would send it to peer review. The framing is coherent and the interference problem is real; the experiments, if they exist in the full manuscript, are what will decide whether the method is worth adopting.

Referee Report

2 major / 2 minor

Summary. The paper proposes Mix-MoE, a two-stage post-pretraining Mixture-of-Experts framework for multilingual machine translation in LLMs. Stage 1 applies MoE to monolingual corpora with LM Experts; stage 2 applies MoE to parallel corpora with MT Experts. The MoE layers are partitioned into these two fixed groups, and a Fourier Transform feature-enhanced router is introduced to enable interaction while mitigating parameter interference. The central claim is that this design yields significant gains over baselines in multilingual MT and reduces interference.

Significance. If the empirical claims hold with proper controls and ablations, the work would offer a concrete architectural approach to separating monolingual and bilingual knowledge in LLM fine-tuning for MT, which could be useful for scaling multilingual capabilities without proportional increases in interference.

major comments (2)

[Abstract / framework] Abstract and framework description: the central claim that the fixed LM/MT expert split plus Fourier routing reduces parameter interference rests on the unverified assumption that the router will not route monolingual tokens to MT Experts (or vice versa) at inference; no analysis, routing statistics, or ablation is supplied to show that cross-group routing is prevented or that capacity does not collapse.
[Abstract] Abstract: the assertion of 'significantly outperforming existing baselines' and 'notable progress in mitigating parameter interference' is stated without any reported metrics, baselines, language pairs, model sizes, or statistical significance; the soundness of the headline result cannot be assessed from the supplied evidence.

minor comments (2)

[Method] The precise mathematical form of the Fourier features added to the router and how they are combined with the standard gating function should be stated explicitly (e.g., as an equation).
[Method] Clarify whether the two expert groups share any parameters or whether the router is trained jointly or in separate phases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point by point to the major comments and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / framework] Abstract and framework description: the central claim that the fixed LM/MT expert split plus Fourier routing reduces parameter interference rests on the unverified assumption that the router will not route monolingual tokens to MT Experts (or vice versa) at inference; no analysis, routing statistics, or ablation is supplied to show that cross-group routing is prevented or that capacity does not collapse.

Authors: We agree that the current manuscript lacks explicit routing statistics, cross-group routing analysis, or dedicated ablations to verify that the router largely respects the LM/MT expert partition at inference. The architectural intent is that the fixed grouping plus Fourier features limits harmful interference while still permitting interaction, but this remains an assumption without direct empirical support in the submitted version. We will add routing distribution statistics across expert groups and an ablation isolating cross-group routing in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: the assertion of 'significantly outperforming existing baselines' and 'notable progress in mitigating parameter interference' is stated without any reported metrics, baselines, language pairs, model sizes, or statistical significance; the soundness of the headline result cannot be assessed from the supplied evidence.

Authors: The abstract is written as a high-level summary; all quantitative results, baselines, language pairs, model sizes, and significance tests appear in the experimental sections of the full manuscript. This follows common practice for abstracts. If the editor prefers, we can insert a small number of key headline metrics into the abstract during revision. revision: partial

Circularity Check

0 steps flagged

No circularity: Mix-MoE is an empirical architecture proposal with no derivation chain

full rationale

The paper proposes a two-stage post-pretraining process, a fixed split of MoE layers into LM Experts (monolingual data) and MT Experts (parallel data), and a Fourier-feature router. These are explicit design choices presented as solutions to parameter interference. No equations, fitted parameters, or first-principles derivations are described that reduce to self-defined quantities or self-citations. Claims rest on experimental comparisons to baselines, which are external evaluations rather than internal reductions. The method is therefore self-contained as an engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that the proposed expert split and routing will separate knowledge types without new interference; no free parameters, axioms, or invented entities are quantified in the abstract.

axioms (1)

domain assumption Specialized LM and MT expert groups can be trained sequentially on monolingual then parallel data without destructive interference.
This premise is invoked to justify the two-stage design and the division of MoE layers.

invented entities (1)

LM Experts and MT Experts no independent evidence
purpose: To separately store monolingual knowledge and bilingual translation knowledge inside the same MoE layers.
These are introduced as distinct groups of experts within the MoE layers.

pith-pipeline@v0.9.1-grok · 5741 in / 1271 out tokens · 26888 ms · 2026-06-30T13:20:02.968891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Multilingual mix: Example interpolation improves multilingual neural machine translation,

Y . Cheng, A. Bapna, O. Firat, Y . Cao, P. Wang, and W. Macherey, “Multilingual mix: Example interpolation improves multilingual neural machine translation,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicenc...

2022
[2]

Towards higher pareto frontier in multilingual machine translation,

Y . Huang, X. Feng, X. Geng, B. Li, and B. Qin, “Towards higher pareto frontier in multilingual machine translation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational...

2023
[3]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7- 9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015

2015
[4]

New trends in machine translation using large language models: Case examples with chatgpt,

C. Lyu, J. Xu, and L. Wang, “New trends in machine translation using large language models: Case examples with chatgpt,”CoRR, vol. abs/2305.01181, 2023

work page arXiv 2023
[5]

Multilingual machine translation with large language models: Empirical results and analysis,

W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Multilingual machine translation with large language models: Empirical results and analysis,” inFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. G ´omez-Adorno, and S. Bethard, Eds. Association for Computational L...

2024
[6]

Revolutionising translation with ai: Unravelling neural machine translation and generative pre-trained large language models,

S. C. Siu, “Revolutionising translation with ai: Unravelling neural machine translation and generative pre-trained large language models,” inNew Advances in Translation Technology: Applications and Pedagogy. Springer, 2024, pp. 29–54

2024
[7]

Continual learning with semi-supervised contrastive distillation for incremental neural machine translation,

Y . Liang, F. Meng, J. Wang, J. Xu, Y . Chen, and J. Zhou, “Continual learning with semi-supervised contrastive distillation for incremental neural machine translation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and ...

2024
[8]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Y . Luo, Z. Yang, F. Meng, Y . Li, J. Zhou, and Y . Zhang, “An empir- ical study of catastrophic forgetting in large language models during continual fine-tuning,”CoRR, vol. abs/2308.08747, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Simple and scalable strategies to continually pre-train large language models,

A. Ibrahim, B. Th ´erien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish, “Simple and scalable strategies to continually pre-train large language models,”Trans. Mach. Learn. Res., vol. 2024, 2024

2024
[10]

Breaking the script barrier in multilingual pre-trained language models with transliteration-based post- training alignment,

O. Xhelili, Y . Liu, and H. Sch ¨utze, “Breaking the script barrier in multilingual pre-trained language models with transliteration-based post- training alignment,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y . Chen, Eds. Association for Computational...

2024
[11]

Towards incremental learning in large language models: A critical review,

M. Jovanovic and P. V oss, “Towards incremental learning in large language models: A critical review,”CoRR, vol. abs/2404.18311, 2024

work page arXiv 2024
[12]

Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing,

H. Zhou, Z. Wang, S. Huang, X. Huang, X. Han, J. Feng, C. Deng, W. Luo, and J. Chen, “Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing,”CoRR, vol. abs/2408.11396, 2024

work page arXiv 2024
[13]

Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models,

S. Zhu, L. Pan, D. Jian, and D. Xiong, “Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models,”Information Processing & Management, vol. 62, no. 3, p. 104078, 2025

2025
[14]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S...

2017
[15]

Transfer learning for low- resource neural machine translation,

B. Zoph, D. Yuret, J. May, and K. Knight, “Transfer learning for low- resource neural machine translation,” inProceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh, Eds. The Association for Computational Linguistics, 2016, pp. 1568–1575

2016
[16]

Rapid adaptation of neural machine translation to new languages,

G. Neubig and J. Hu, “Rapid adaptation of neural machine translation to new languages,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Association for Computational Linguistics, 2018, pp. 875–880

2018
[17]

Improving neural machine translation models with monolingual data,

R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016

2016
[18]

Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders,

G. Chen, S. Ma, Y . Chen, L. Dong, D. Zhang, J. Pan, W. Wang, and F. Wei, “Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens...

2021
[19]

Towards robust in-context learning for machine translation with large language models,

S. Zhu, M. Cui, and D. Xiong, “Towards robust in-context learning for machine translation with large language models,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari, M. Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, E...

2024
[20]

On the multilingual ability of decoder-based pre-trained language mod- els: Finding and controlling language-specific neurons,

T. Kojima, I. Okimura, Y . Iwasawa, H. Yanaka, and Y . Matsuo, “On the multilingual ability of decoder-based pre-trained language mod- els: Finding and controlling language-specific neurons,”CoRR, vol. abs/2404.02431, 2024

work page arXiv 2024
[21]

A paradigm shift: The future of machine translation lies with large language models,

C. Lyu, Z. Du, J. Xu, Y . Duan, M. Wu, T. Lynn, A. F. Aji, D. F. Wong, and L. Wang, “A paradigm shift: The future of machine translation lies with large language models,” inProceedings of the 2024 Joint Interna- tional Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari...

2024
[22]

Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation,

J. Waldendorf, A. Birch, B. Hadow, and A. V . M. Barone, “Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation,” inProceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), AMTA 2022, Orlando, USA, September 12- 16, 20...

2022
[23]

Exploiting domain-specific par- allel data on multilingual language models for low-resource language translation,

S. Ranathunga, S. Nayak, S. C. Huang, Y . Mao, T. Su, Y . R. Chan, S. Yuan, A. Rinaldi, and E. A. Lee, “Exploiting domain-specific par- allel data on multilingual language models for low-resource language translation,”CoRR, vol. abs/2412.19522, 2024

work page arXiv 2024
[24]

Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation,

T. Zhang, X. Wang, B. Liang, and B. Yuan, “Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation,”IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 12, pp. 9925–9939, 2023

2023
[25]

Understanding catas- trophic forgetting in language models via implicit inference,

S. Kotha, J. M. Springer, and A. Raghunathan, “Understanding catas- trophic forgetting in language models via implicit inference,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[26]

Efficient continual pre-training for building domain specific large language models,

Y . Xie, K. Aggarwal, and A. Ahmad, “Efficient continual pre-training for building domain specific large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 10 184–10 201

2024
[27]

Overcoming catastrophic forgetting in graph neural networks,

H. Liu, Y . Yang, and X. Wang, “Overcoming catastrophic forgetting in graph neural networks,” inThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli- cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, ...

2021
[28]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with lora,”CoRR, vol. abs/2312.03732, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

arXiv preprint arXiv:2112.10684 , year=

M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V . Lin, J. Du, S. Iyer, R. Pasunuruet al., “Efficient large scale language modeling with mixtures of experts,”arXiv preprint arXiv:2112.10684, 2021

work page arXiv 2021
[30]

Distributed learning of mixtures of experts,

F. Chamroukhi and N. T. Pham, “Distributed learning of mixtures of experts,”CoRR, vol. abs/2312.09877, 2023

work page arXiv 2023
[31]

Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts,

H. Yu, Z. Qi, L. Jang, R. Salakhutdinov, L. Morency, and P. P. Liang, “Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y . Chen, Eds. Association for ...

2024
[32]

From sparse to soft mixtures of experts,

J. Puigcerver, C. R. Ruiz, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[33]

Efficient large scale language modeling with mixtures of experts,

M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V . Lin, J. Du, S. Iyer, R. Pasunuru, G. Anantharaman, X. Li, S. Chen, H. Akin, M. Baines, L. Martin, X. Zhou, P. S. Koura, B. O’Horo, J. Wang, L. Zettlemoyer, M. T. Diab, Z. Kozareva, and V . Stoyanov, “Efficient large scale language modeling with mixtures of experts,” in Proceedings ...

2022
[34]

A paradigm shift in machine translation: Boosting translation performance of large language models,

H. Xu, Y . J. Kim, A. Sharaf, and H. H. Awadalla, “A paradigm shift in machine translation: Boosting translation performance of large language models,” inThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[35]

X-ALMA: plug & play modules and adaptive rejection for quality translation at scale,

H. Xu, K. Murray, P. Koehn, H. Hoang, A. Eriguchi, and H. Khayrallah, “X-ALMA: plug & play modules and adaptive rejection for quality translation at scale,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[36]

Digital signal processing: signals systems and filters,

A. Antoniu, “Digital signal processing: signals systems and filters,” 2006

2006
[37]

On the relation between linguistic typology and (limitations of) multilin- gual language modeling,

D. Gerz, I. Vulic, E. M. Ponti, R. Reichart, and A. Korhonen, “On the relation between linguistic typology and (limitations of) multilin- gual language modeling,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, E...

2018
[38]

Frequency of basic english grammatical structures: A corpus analysis,

D. Roland, F. Dick, and J. L. Elman, “Frequency of basic english grammatical structures: A corpus analysis,”Journal of memory and language, vol. 57, no. 3, pp. 348–379, 2007

2007
[39]

Deep Learning Based Natural Language Processing for End to End Speech Translation

S. Patil, “Deep learning based natural language processing for end to end speech translation,”arXiv preprint arXiv:1808.04459, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[40]

Dnn-based cross-lingual voice conversion using bottleneck features,

M. Kiran Reddy and K. Sreenivasa Rao, “Dnn-based cross-lingual voice conversion using bottleneck features,”Neural Processing Letters, vol. 51, no. 2, pp. 2029–2042, 2020

2029
[41]

Learn- ing a Fourier transform for linear relative positional encodings in transformers,

K. Choromanski, S. Li, V . Likhosherstov, K. Avinava Dubey, S. Luo, D. He, Y . Yang, T. Sarlos, T. Weingarten, and A. Weller, “Learn- ing a Fourier transform for linear relative positional encodings in transformers,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research,...

2024
[42]

Learnable fourier features for multi-dimensional spatial positional encoding,

Y . Li, S. Si, G. Li, C.-J. Hsieh, and S. Bengio, “Learnable fourier features for multi-dimensional spatial positional encoding,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 15 816–15 829

2021
[43]

Fourier position embedding: Enhancing attention’s periodic extension for length generalization,

E. Hua, C. Jiang, X. Lv, K. Zhang, N. Ding, Y . Sun, B. Qi, Y . Fan, X. K. Zhu, and B. Zhou, “Fourier position embedding: Enhancing attention’s periodic extension for length generalization,”arXiv preprint arXiv:2412.17739, 2024

work page arXiv 2024
[44]

Transformer feed-forward layers are key-value memories,

M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed-forward layers are key-value memories,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational L...

2021
[45]

Knowledge neurons in pretrained transformers,

D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowledge neurons in pretrained transformers,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Association for Computational Linguistics, 20...

2022
[46]

A mathematical framework for transformer circuits,

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerlyet al., “A mathematical framework for transformer circuits,”Transformer Circuits Thread, vol. 1, no. 1, p. 12, 2021

2021
[47]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” CoRR, vol. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022

2022
[49]

Mixtral of experts,

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” 2024

2024
[50]

Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training,

X. Qu, D. Dong, X. Hu, T. Zhu, W. Sun, and Y . Cheng, “Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training,”CoRR, vol. abs/2411.15708, 2024

work page arXiv 2024
[51]

Llama-moe: Building mixture-of-experts from llama with continual pre- training,

T. Zhu, X. Qu, D. Dong, J. Ruan, J. Tong, C. He, and Y . Cheng, “Llama-moe: Building mixture-of-experts from llama with continual pre- training,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 15 913–15 923

2024
[52]

OLMoE: Open Mixture-of-Experts Language Models

N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambertet al., “Olmoe: Open mixture- of-experts language models,”arXiv preprint arXiv:2409.02060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

A call for clarity in reporting BLEU scores,

M. Post, “A call for clarity in reporting BLEU scores,” inProceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, C. Monz, M. Negri, A. N ´ev´eol, M. L. Neves, M. Post...

2018
[54]

METEOR: an automatic metric for MT evalu- ation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evalu- ation with improved correlation with human judgments,” inProceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, J. Goldstein, A. Lavie, C. Lin, and C. R. V oss, Eds. Associ...

2005
[55]

COMET: A neural framework for MT evaluation,

R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “COMET: A neural framework for MT evaluation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2685–2702

2020
[56]

Qa-lora: Quantization-aware low-rank adaptation of large language models,

Y . Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low-rank adaptation of large language models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

2024
[57]

Llama pro: Progressive llama with block expansion,

C. Wu, Y . Gan, Y . Ge, Z. Lu, J. Wang, Y . Feng, Y . Shan, and P. Luo, “Llama pro: Progressive llama with block expansion,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computat...

2024
[58]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D....

2022
[59]

No Language Left Behind: Scaling Human-Centered Machine Translation

M. R. Costa-Juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillardet al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Multilingual mix: Example interpolation improves multilingual neural machine translation,

Y . Cheng, A. Bapna, O. Firat, Y . Cao, P. Wang, and W. Macherey, “Multilingual mix: Example interpolation improves multilingual neural machine translation,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicenc...

2022

[2] [2]

Towards higher pareto frontier in multilingual machine translation,

Y . Huang, X. Feng, X. Geng, B. Li, and B. Qin, “Towards higher pareto frontier in multilingual machine translation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational...

2023

[3] [3]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7- 9, 2015, Conference Track Proceedings, Y . Bengio and Y . LeCun, Eds., 2015

2015

[4] [4]

New trends in machine translation using large language models: Case examples with chatgpt,

C. Lyu, J. Xu, and L. Wang, “New trends in machine translation using large language models: Case examples with chatgpt,”CoRR, vol. abs/2305.01181, 2023

work page arXiv 2023

[5] [5]

Multilingual machine translation with large language models: Empirical results and analysis,

W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Multilingual machine translation with large language models: Empirical results and analysis,” inFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, K. Duh, H. G ´omez-Adorno, and S. Bethard, Eds. Association for Computational L...

2024

[6] [6]

Revolutionising translation with ai: Unravelling neural machine translation and generative pre-trained large language models,

S. C. Siu, “Revolutionising translation with ai: Unravelling neural machine translation and generative pre-trained large language models,” inNew Advances in Translation Technology: Applications and Pedagogy. Springer, 2024, pp. 29–54

2024

[7] [7]

Continual learning with semi-supervised contrastive distillation for incremental neural machine translation,

Y . Liang, F. Meng, J. Wang, J. Xu, Y . Chen, and J. Zhou, “Continual learning with semi-supervised contrastive distillation for incremental neural machine translation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and ...

2024

[8] [8]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Y . Luo, Z. Yang, F. Meng, Y . Li, J. Zhou, and Y . Zhang, “An empir- ical study of catastrophic forgetting in large language models during continual fine-tuning,”CoRR, vol. abs/2308.08747, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Simple and scalable strategies to continually pre-train large language models,

A. Ibrahim, B. Th ´erien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish, “Simple and scalable strategies to continually pre-train large language models,”Trans. Mach. Learn. Res., vol. 2024, 2024

2024

[10] [10]

Breaking the script barrier in multilingual pre-trained language models with transliteration-based post- training alignment,

O. Xhelili, Y . Liu, and H. Sch ¨utze, “Breaking the script barrier in multilingual pre-trained language models with transliteration-based post- training alignment,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y . Chen, Eds. Association for Computational...

2024

[11] [11]

Towards incremental learning in large language models: A critical review,

M. Jovanovic and P. V oss, “Towards incremental learning in large language models: A critical review,”CoRR, vol. abs/2404.18311, 2024

work page arXiv 2024

[12] [12]

Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing,

H. Zhou, Z. Wang, S. Huang, X. Huang, X. Han, J. Feng, C. Deng, W. Luo, and J. Chen, “Moe-lpr: Multilingual extension of large language models through mixture-of-experts with language priors routing,”CoRR, vol. abs/2408.11396, 2024

work page arXiv 2024

[13] [13]

Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models,

S. Zhu, L. Pan, D. Jian, and D. Xiong, “Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models,”Information Processing & Management, vol. 62, no. 3, p. 104078, 2025

2025

[14] [14]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S...

2017

[15] [15]

Transfer learning for low- resource neural machine translation,

B. Zoph, D. Yuret, J. May, and K. Knight, “Transfer learning for low- resource neural machine translation,” inProceedings of the 2016 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh, Eds. The Association for Computational Linguistics, 2016, pp. 1568–1575

2016

[16] [16]

Rapid adaptation of neural machine translation to new languages,

G. Neubig and J. Hu, “Rapid adaptation of neural machine translation to new languages,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Association for Computational Linguistics, 2018, pp. 875–880

2018

[17] [17]

Improving neural machine translation models with monolingual data,

R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016

2016

[18] [18]

Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders,

G. Chen, S. Ma, Y . Chen, L. Dong, D. Zhang, J. Pan, W. Wang, and F. Wei, “Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens...

2021

[19] [19]

Towards robust in-context learning for machine translation with large language models,

S. Zhu, M. Cui, and D. Xiong, “Towards robust in-context learning for machine translation with large language models,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari, M. Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, E...

2024

[20] [20]

On the multilingual ability of decoder-based pre-trained language mod- els: Finding and controlling language-specific neurons,

T. Kojima, I. Okimura, Y . Iwasawa, H. Yanaka, and Y . Matsuo, “On the multilingual ability of decoder-based pre-trained language mod- els: Finding and controlling language-specific neurons,”CoRR, vol. abs/2404.02431, 2024

work page arXiv 2024

[21] [21]

A paradigm shift: The future of machine translation lies with large language models,

C. Lyu, Z. Du, J. Xu, Y . Duan, M. Wu, T. Lynn, A. F. Aji, D. F. Wong, and L. Wang, “A paradigm shift: The future of machine translation lies with large language models,” inProceedings of the 2024 Joint Interna- tional Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari...

2024

[22] [22]

Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation,

J. Waldendorf, A. Birch, B. Hadow, and A. V . M. Barone, “Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation,” inProceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), AMTA 2022, Orlando, USA, September 12- 16, 20...

2022

[23] [23]

Exploiting domain-specific par- allel data on multilingual language models for low-resource language translation,

S. Ranathunga, S. Nayak, S. C. Huang, Y . Mao, T. Su, Y . R. Chan, S. Yuan, A. Rinaldi, and E. A. Lee, “Exploiting domain-specific par- allel data on multilingual language models for low-resource language translation,”CoRR, vol. abs/2412.19522, 2024

work page arXiv 2024

[24] [24]

Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation,

T. Zhang, X. Wang, B. Liang, and B. Yuan, “Catastrophic interference in reinforcement learning: A solution based on context division and knowledge distillation,”IEEE Trans. Neural Networks Learn. Syst., vol. 34, no. 12, pp. 9925–9939, 2023

2023

[25] [25]

Understanding catas- trophic forgetting in language models via implicit inference,

S. Kotha, J. M. Springer, and A. Raghunathan, “Understanding catas- trophic forgetting in language models via implicit inference,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[26] [26]

Efficient continual pre-training for building domain specific large language models,

Y . Xie, K. Aggarwal, and A. Ahmad, “Efficient continual pre-training for building domain specific large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp. 10 184–10 201

2024

[27] [27]

Overcoming catastrophic forgetting in graph neural networks,

H. Liu, Y . Yang, and X. Wang, “Overcoming catastrophic forgetting in graph neural networks,” inThirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli- cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, ...

2021

[28] [28]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with lora,”CoRR, vol. abs/2312.03732, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

arXiv preprint arXiv:2112.10684 , year=

M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V . Lin, J. Du, S. Iyer, R. Pasunuruet al., “Efficient large scale language modeling with mixtures of experts,”arXiv preprint arXiv:2112.10684, 2021

work page arXiv 2021

[30] [30]

Distributed learning of mixtures of experts,

F. Chamroukhi and N. T. Pham, “Distributed learning of mixtures of experts,”CoRR, vol. abs/2312.09877, 2023

work page arXiv 2023

[31] [31]

Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts,

H. Yu, Z. Qi, L. Jang, R. Salakhutdinov, L. Morency, and P. P. Liang, “Mmoe: Enhancing multimodal models with mixtures of multimodal interaction experts,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Y . Al-Onaizan, M. Bansal, and Y . Chen, Eds. Association for ...

2024

[32] [32]

From sparse to soft mixtures of experts,

J. Puigcerver, C. R. Ruiz, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[33] [33]

Efficient large scale language modeling with mixtures of experts,

M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V . Lin, J. Du, S. Iyer, R. Pasunuru, G. Anantharaman, X. Li, S. Chen, H. Akin, M. Baines, L. Martin, X. Zhou, P. S. Koura, B. O’Horo, J. Wang, L. Zettlemoyer, M. T. Diab, Z. Kozareva, and V . Stoyanov, “Efficient large scale language modeling with mixtures of experts,” in Proceedings ...

2022

[34] [34]

A paradigm shift in machine translation: Boosting translation performance of large language models,

H. Xu, Y . J. Kim, A. Sharaf, and H. H. Awadalla, “A paradigm shift in machine translation: Boosting translation performance of large language models,” inThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[35] [35]

X-ALMA: plug & play modules and adaptive rejection for quality translation at scale,

H. Xu, K. Murray, P. Koehn, H. Hoang, A. Eriguchi, and H. Khayrallah, “X-ALMA: plug & play modules and adaptive rejection for quality translation at scale,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[36] [36]

Digital signal processing: signals systems and filters,

A. Antoniu, “Digital signal processing: signals systems and filters,” 2006

2006

[37] [37]

On the relation between linguistic typology and (limitations of) multilin- gual language modeling,

D. Gerz, I. Vulic, E. M. Ponti, R. Reichart, and A. Korhonen, “On the relation between linguistic typology and (limitations of) multilin- gual language modeling,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, E...

2018

[38] [38]

Frequency of basic english grammatical structures: A corpus analysis,

D. Roland, F. Dick, and J. L. Elman, “Frequency of basic english grammatical structures: A corpus analysis,”Journal of memory and language, vol. 57, no. 3, pp. 348–379, 2007

2007

[39] [39]

Deep Learning Based Natural Language Processing for End to End Speech Translation

S. Patil, “Deep learning based natural language processing for end to end speech translation,”arXiv preprint arXiv:1808.04459, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[40] [40]

Dnn-based cross-lingual voice conversion using bottleneck features,

M. Kiran Reddy and K. Sreenivasa Rao, “Dnn-based cross-lingual voice conversion using bottleneck features,”Neural Processing Letters, vol. 51, no. 2, pp. 2029–2042, 2020

2029

[41] [41]

Learn- ing a Fourier transform for linear relative positional encodings in transformers,

K. Choromanski, S. Li, V . Likhosherstov, K. Avinava Dubey, S. Luo, D. He, Y . Yang, T. Sarlos, T. Weingarten, and A. Weller, “Learn- ing a Fourier transform for linear relative positional encodings in transformers,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research,...

2024

[42] [42]

Learnable fourier features for multi-dimensional spatial positional encoding,

Y . Li, S. Si, G. Li, C.-J. Hsieh, and S. Bengio, “Learnable fourier features for multi-dimensional spatial positional encoding,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 15 816–15 829

2021

[43] [43]

Fourier position embedding: Enhancing attention’s periodic extension for length generalization,

E. Hua, C. Jiang, X. Lv, K. Zhang, N. Ding, Y . Sun, B. Qi, Y . Fan, X. K. Zhu, and B. Zhou, “Fourier position embedding: Enhancing attention’s periodic extension for length generalization,”arXiv preprint arXiv:2412.17739, 2024

work page arXiv 2024

[44] [44]

Transformer feed-forward layers are key-value memories,

M. Geva, R. Schuster, J. Berant, and O. Levy, “Transformer feed-forward layers are key-value memories,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational L...

2021

[45] [45]

Knowledge neurons in pretrained transformers,

D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowledge neurons in pretrained transformers,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Association for Computational Linguistics, 20...

2022

[46] [46]

A mathematical framework for transformer circuits,

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerlyet al., “A mathematical framework for transformer circuits,”Transformer Circuits Thread, vol. 1, no. 1, p. 12, 2021

2021

[47] [47]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” CoRR, vol. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022

2022

[49] [49]

Mixtral of experts,

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” 2024

2024

[50] [50]

Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training,

X. Qu, D. Dong, X. Hu, T. Zhu, W. Sun, and Y . Cheng, “Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training,”CoRR, vol. abs/2411.15708, 2024

work page arXiv 2024

[51] [51]

Llama-moe: Building mixture-of-experts from llama with continual pre- training,

T. Zhu, X. Qu, D. Dong, J. Ruan, J. Tong, C. He, and Y . Cheng, “Llama-moe: Building mixture-of-experts from llama with continual pre- training,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 15 913–15 923

2024

[52] [52]

OLMoE: Open Mixture-of-Experts Language Models

N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambertet al., “Olmoe: Open mixture- of-experts language models,”arXiv preprint arXiv:2409.02060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

A call for clarity in reporting BLEU scores,

M. Post, “A call for clarity in reporting BLEU scores,” inProceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, C. Monz, M. Negri, A. N ´ev´eol, M. L. Neves, M. Post...

2018

[54] [54]

METEOR: an automatic metric for MT evalu- ation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evalu- ation with improved correlation with human judgments,” inProceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, J. Goldstein, A. Lavie, C. Lin, and C. R. V oss, Eds. Associ...

2005

[55] [55]

COMET: A neural framework for MT evaluation,

R. Rei, C. Stewart, A. C. Farinha, and A. Lavie, “COMET: A neural framework for MT evaluation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2685–2702

2020

[56] [56]

Qa-lora: Quantization-aware low-rank adaptation of large language models,

Y . Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low-rank adaptation of large language models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- Review.net, 2024

2024

[57] [57]

Llama pro: Progressive llama with block expansion,

C. Wu, Y . Gan, Y . Ge, Z. Lu, J. Wang, Y . Feng, Y . Shan, and P. Luo, “Llama pro: Progressive llama with block expansion,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computat...

2024

[58] [58]

Flashattention: Fast and memory-efficient exact attention with io-awareness,

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D....

2022

[59] [59]

No Language Left Behind: Scaling Human-Centered Machine Translation

M. R. Costa-Juss `a, J. Cross, O. C ¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillardet al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022