Mixture of Experts for Low-Resource LLMs

Dan Revital; Noam Kayzer; Ori Bar Joseph; Sarel Weinberger; Smadar Arvatz

arxiv: 2605.17598 · v1 · pith:TW4ZVCRFnew · submitted 2026-05-17 · 💻 cs.CL

Mixture of Experts for Low-Resource LLMs

Ori Bar Joseph , Smadar Arvatz , Noam Kayzer , Dan Revital , Sarel Weinberger This is my paper

Pith reviewed 2026-05-20 12:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords Mixture of Expertsrouting dynamicslow-resource languagescontinual pre-trainingmultilingual LLMsexpert specializationrouting entropy

0 comments

The pith

Low-resource languages trigger deep-layer routing collapse in Mixture-of-Experts models, which balanced continual pre-training largely reverses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks how two different MoE architectures route tokens when processing Hebrew, a morphologically rich low-resource language. In both a pure Transformer and a hybrid Mamba-Transformer model, pre-training produces sharp entropy collapse in the final layers, with most tokens funneled to a small expert subset, a pattern absent in English. Continual pre-training on balanced bilingual data raises entropy and moves routing toward shared, language-agnostic experts, while supervised fine-tuning alone leaves more imbalance. The same collapse signatures appear in Japanese, and the routing changes track measurable gains on downstream benchmarks.

Core claim

Both pre-trained models exhibit deep-layer routing collapse: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing entropy,

What carries the argument

Deep-layer routing collapse, the sharp drop in expert-usage entropy and concentration of tokens onto few experts in the final layers of pre-trained MoE models.

If this is right

Routing entropy and expert specialization act as practical diagnostics for multilingual capacity in MoE systems.
Balanced bilingual continual pre-training restores more balanced expert usage than supervised fine-tuning alone.
The collapse pattern is not unique to Hebrew but appears in other underrepresented languages such as Japanese.
Improved routing toward language-agnostic experts correlates with gains on standard multilingual benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Initial pre-training data balance may be more important than later corrective stages for preventing routing bias.
Similar collapse could appear in any low-resource language pair, suggesting a general test for MoE multilingual readiness.
If routing entropy proves causal, training objectives that directly reward high-entropy routing could supplement standard loss functions.

Load-bearing premise

The observed routing collapse and its correction are a systematic consequence of pre-training underrepresentation rather than language-intrinsic properties, and that routing changes causally drive the reported downstream benchmark gains.

What would settle it

Measure whether an MoE model pre-trained from scratch on equal English-Hebrew token counts shows the same final-layer entropy drop and narrow expert concentration as the original models.

Figures

Figures reproduced from arXiv: 2605.17598 by Dan Revital, Noam Kayzer, Ori Bar Joseph, Sarel Weinberger, Smadar Arvatz.

**Figure 2.** Figure 2: Gini coefficient and active expert count across MoE layers for all model variants. Each [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Expert language specialization analysis for all model variants. Each panel displays [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-lingual activation similarity per layer for all model variants. Each panel displays [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows deep-layer routing collapse for Hebrew and Japanese in two MoE models, with CPT on balanced data fixing entropy and expert spread more than SFT does.

read the letter

The key observation is that both the Qwen3 Transformer MoE and the Nemotron hybrid Mamba-Transformer show a sharp drop in routing entropy in the final layers when processing Hebrew or Japanese, with tokens piling up on a small set of experts. English does not exhibit this. Continual pre-training on balanced bilingual data raises entropy again and moves routing toward more shared experts, while supervised fine-tuning alone leaves more of the collapse in place. The same pattern appears in Japanese, which supports the claim that underrepresentation during pre-training, not language-specific traits, drives the behavior.

Referee Report

2 major / 2 minor

Summary. The paper analyzes routing dynamics in two MoE models (Qwen3-30B-A3B and Nemotron-3-Nano-30B-A3B) for low-resource languages, using Hebrew as primary testbed and Japanese for cross-linguistic validation. It reports deep-layer routing collapse (sharp drop in usage entropy and token concentration on narrow expert subsets) in pre-trained models for these languages, largely absent for English; continual pre-training on balanced bilingual data increases entropy and shifts routing toward shared experts, with SFT achieving less correction; these routing changes correlate with downstream benchmark gains, positioning entropy and specialization as diagnostics for multilingual MoE capacity.

Significance. If substantiated with stronger controls, the work offers useful empirical diagnostics for expert utilization in multilingual MoE settings and shows that balanced CPT can mitigate underrepresentation effects, which may inform training strategies for low-resource languages. The cross-linguistic consistency strengthens the case that the patterns stem from data imbalance rather than language-specific traits.

major comments (2)

[Abstract and Results] Abstract and results sections: the central claim that routing improvements 'correlate with consistent downstream benchmark gains' and position entropy/specialization as 'principled diagnostics' rests on purely observational data; no ablation (e.g., freezing routing while holding data fixed or auxiliary balanced-routing losses) isolates whether routing dynamics causally drive the gains or are a byproduct of balanced data exposure. This is load-bearing for the diagnostic framing.
[Experimental Setup and Analysis] Experimental setup and analysis sections: claims of 'sharp' entropy drops and 'substantial' correction lack reported statistical tests, exact token counts per language/layer, precise layer definitions, or controls for confounding factors such as model architecture differences or total compute. This leaves the reported patterns under-supported.

minor comments (2)

[Methods] Clarify notation for 'usage entropy' and 'routing collapse' with explicit formulas or references in the methods to ensure reproducibility.
[Figures] Figure captions should explicitly state the number of tokens sampled and the exact layers examined for each language comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, proposing targeted revisions to improve rigor while preserving the observational scope of the study.

read point-by-point responses

Referee: [Abstract and Results] Abstract and results sections: the central claim that routing improvements 'correlate with consistent downstream benchmark gains' and position entropy/specialization as 'principled diagnostics' rests on purely observational data; no ablation (e.g., freezing routing while holding data fixed or auxiliary balanced-routing losses) isolates whether routing dynamics causally drive the gains or are a byproduct of balanced data exposure. This is load-bearing for the diagnostic framing.

Authors: We agree the analysis is observational and shows correlation between routing changes and benchmark gains rather than establishing direct causation. The patterns are consistent across two architecturally distinct models and two languages, supporting the utility of entropy and specialization as empirical diagnostics for underrepresentation effects. In the revision we will explicitly qualify the abstract and results sections to describe these as correlational diagnostics, add a limitations paragraph noting the absence of causal ablations such as routing freezing or auxiliary losses, and frame the contribution around the identification of reproducible patterns rather than causal claims. revision: partial
Referee: [Experimental Setup and Analysis] Experimental setup and analysis sections: claims of 'sharp' entropy drops and 'substantial' correction lack reported statistical tests, exact token counts per language/layer, precise layer definitions, or controls for confounding factors such as model architecture differences or total compute. This leaves the reported patterns under-supported.

Authors: We accept that additional quantitative details and statistical support are warranted. The revised manuscript will report statistical tests (paired t-tests or Wilcoxon signed-rank tests) for entropy differences across layers and conditions, include exact token counts per language and layer in the appendix, and explicitly define deep layers (final 10 layers of each model). We will also expand the experimental setup to discuss architecture and compute controls, noting that both models were evaluated under matched inference conditions while acknowledging that full compute-matched ablations lie outside the present scope and will be listed as future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical routing analysis

full rationale

The paper's claims rest on direct observational measurements of routing entropy, expert usage patterns across layers and languages, and correlations with benchmark scores before and after continual pre-training. No equations or derivations reduce by construction to fitted inputs, self-defined quantities, or load-bearing self-citations; the analysis treats routing collapse as an empirical phenomenon diagnosed from model behavior on Hebrew and Japanese versus English, with CPT as an external intervention. This is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that routing entropy reliably indicates multilingual capacity and that observed benchmark gains stem from routing changes rather than other training effects.

axioms (2)

domain assumption Routing entropy and expert concentration are valid proxies for language-specific capacity in MoE models.
Used to define and quantify collapse.
domain assumption The two chosen models are representative of pure Transformer and hybrid MoE architectures.
Basis for generalizing the collapse pattern.

pith-pipeline@v0.9.0 · 5739 in / 1354 out tokens · 60170 ms · 2026-05-20T12:40:40.907639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 7 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations (ICLR) , year=

work page
[2]

Journal of Machine Learning Research , volume=

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , volume=

work page
[3]

Du, Nan and Huang, Yanping and Dai, Andrew M and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and others , booktitle=

work page
[4]

Mixtral of Experts

Mixtral of Experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Mixture-of-Experts with Expert Choice Routing , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[7]

Lepikhin, Dmitry and Lee, HyoukJoong and Xu, Yuanzhong and Chen, Dehao and Firat, Orhan and Huang, Yanping and Krikun, Maxim and Shazeer, Noam and Chen, Zhifeng , booktitle=

work page
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Scaling Vision with Sparse Mixture of Experts , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[9]

Zoph, Barret and Bello, Irwan and Kumar, Sameer and Du, Nan and Huang, Yanping and Dean, Jeff and Shazeer, Noam and Fedus, William , journal=

work page
[10]

The State and Fate of Linguistic Diversity and Inclusion in the

Joshi, Pratik and Santy, Sebastin and Budhiraja, Amar and Bali, Kalika and Choudhury, Monojit , booktitle=. The State and Fate of Linguistic Diversity and Inclusion in the

work page
[11]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Unsupervised Cross-lingual Representation Learning at Scale , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[12]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Cross-lingual Language Model Pretraining , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[13]

Xue, Linting and Constant, Noah and Roberts, Adam and Kale, Mihir and Al-Rfou, Rami and Siddhant, Aditya and Barua, Aditya and Raffel, Colin , booktitle=

work page
[14]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=

work page
[15]

arXiv preprint , year=

Cross-Lingual Routing Dynamics in Multilingual Mixture-of-Experts Models , author=. arXiv preprint , year=

work page
[16]

2025 , howpublished=

work page 2025
[17]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2023
[18]

Transformers are

Dao, Tri and Gu, Albert , booktitle=. Transformers are

work page
[19]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Penedo, Guilherme and others , journal=

work page
[21]

Tsarfaty, Reut and Bareket, Dan and Klein, Stav and Seker, Amit , journal=. From

work page
[22]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2020
[23]

Proceedings of the National Academy of Sciences (PNAS) , volume=

Overcoming Catastrophic Forgetting in Neural Networks , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=

work page
[24]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page
[25]

Cettolo, Mauro and Girardi, Christian and Federico, Marcello , booktitle=

work page
[26]

The Bell System Technical Journal , volume=

A Mathematical Theory of Communication , author=. The Bell System Technical Journal , volume=

work page
[27]

Variabilit

Gini, Corrado , journal=. Variabilit

work page
[28]

Biometrika , volume=

A New Measure of Rank Correlation , author=. Biometrika , volume=

work page
[29]

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , booktitle=

work page
[30]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

work page 2021
[31]

Continual Pre-Training for Cross-Lingual

Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae Mizuki and Rio Yokota and Naoaki Okazaki , booktitle=. Continual Pre-Training for Cross-Lingual

work page
[32]

2026 , publisher =

Lin, Leonard and. 2026 , publisher =

work page 2026
[33]

2022 , pages =

Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide , booktitle =. 2022 , pages =

work page 2022
[34]

Findings of the 2020 Conference on Machine Translation (

Barrault, Lo. Findings of the 2020 Conference on Machine Translation (. Proceedings of the Fifth Conference on Machine Translation (WMT) , year =

work page 2020
[35]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[36]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , title =. arXiv preprint arXiv:1803.05457 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[37]

2025 , howpublished =

Swallow. 2025 , howpublished =

work page 2025
[38]

Adapting

Shmidman, Shaltiel and Shmidman, Avi and Cohen, Amir David Nissan and Koppel, Moshe , journal=. Adapting

work page
[39]

Hebatron: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model , institution =

Weinberger, Sarel and Kayzer, Noam and Revital, Dan and. Hebatron: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model , institution =. 2026 , note =

work page 2026
[40]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[42]

2023 , howpublished =

Tyqiangz , title =. 2023 , howpublished =

work page 2023
[43]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

Evaluating Multilingual Word Representations , author =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

work page
[44]

No Language Left Behind: Scaling Human-Centered Machine Translation

No Language Left Behind: Scaling Human-Centered Machine Translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

2023 , publisher =

A framework for few-shot language model evaluation , author =. 2023 , publisher =

work page 2023
[46]

Niwa, Takahiro and others , year =

work page
[47]

2024 , month = aug, howpublished =

work page 2024

[1] [1]

International Conference on Learning Representations (ICLR) , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations (ICLR) , year=

work page

[2] [2]

Journal of Machine Learning Research , volume=

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , volume=

work page

[3] [3]

Du, Nan and Huang, Yanping and Dai, Andrew M and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and others , booktitle=

work page

[4] [4]

Mixtral of Experts

Mixtral of Experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Mixture-of-Experts with Expert Choice Routing , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[7] [7]

Lepikhin, Dmitry and Lee, HyoukJoong and Xu, Yuanzhong and Chen, Dehao and Firat, Orhan and Huang, Yanping and Krikun, Maxim and Shazeer, Noam and Chen, Zhifeng , booktitle=

work page

[8] [8]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Scaling Vision with Sparse Mixture of Experts , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[9] [9]

Zoph, Barret and Bello, Irwan and Kumar, Sameer and Du, Nan and Huang, Yanping and Dean, Jeff and Shazeer, Noam and Fedus, William , journal=

work page

[10] [10]

The State and Fate of Linguistic Diversity and Inclusion in the

Joshi, Pratik and Santy, Sebastin and Budhiraja, Amar and Bali, Kalika and Choudhury, Monojit , booktitle=. The State and Fate of Linguistic Diversity and Inclusion in the

work page

[11] [11]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Unsupervised Cross-lingual Representation Learning at Scale , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page

[12] [12]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Cross-lingual Language Model Pretraining , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page

[13] [13]

Xue, Linting and Constant, Noah and Roberts, Adam and Kale, Mihir and Al-Rfou, Rami and Siddhant, Aditya and Barua, Aditya and Raffel, Colin , booktitle=

work page

[14] [14]

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=

Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , author=. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , year=

work page

[15] [15]

arXiv preprint , year=

Cross-Lingual Routing Dynamics in Multilingual Mixture-of-Experts Models , author=. arXiv preprint , year=

work page

[16] [16]

2025 , howpublished=

work page 2025

[17] [17]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2023

[18] [18]

Transformers are

Dao, Tri and Gu, Albert , booktitle=. Transformers are

work page

[19] [19]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author=. arXiv preprint arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Penedo, Guilherme and others , journal=

work page

[21] [21]

Tsarfaty, Reut and Bareket, Dan and Klein, Stav and Seker, Amit , journal=. From

work page

[22] [22]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2020

[23] [23]

Proceedings of the National Academy of Sciences (PNAS) , volume=

Overcoming Catastrophic Forgetting in Neural Networks , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=

work page

[24] [24]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) , year=

work page

[25] [25]

Cettolo, Mauro and Girardi, Christian and Federico, Marcello , booktitle=

work page

[26] [26]

The Bell System Technical Journal , volume=

A Mathematical Theory of Communication , author=. The Bell System Technical Journal , volume=

work page

[27] [27]

Variabilit

Gini, Corrado , journal=. Variabilit

work page

[28] [28]

Biometrika , volume=

A New Measure of Rank Correlation , author=. Biometrika , volume=

work page

[29] [29]

Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , booktitle=

work page

[30] [30]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

work page 2021

[31] [31]

Continual Pre-Training for Cross-Lingual

Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae Mizuki and Rio Yokota and Naoaki Okazaki , booktitle=. Continual Pre-Training for Cross-Lingual

work page

[32] [32]

2026 , publisher =

Lin, Leonard and. 2026 , publisher =

work page 2026

[33] [33]

2022 , pages =

Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide , booktitle =. 2022 , pages =

work page 2022

[34] [34]

Findings of the 2020 Conference on Machine Translation (

Barrault, Lo. Findings of the 2020 Conference on Machine Translation (. Proceedings of the Fifth Conference on Machine Translation (WMT) , year =

work page 2020

[35] [35]

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page

[36] [36]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , title =. arXiv preprint arXiv:1803.05457 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

2025 , howpublished =

Swallow. 2025 , howpublished =

work page 2025

[38] [38]

Adapting

Shmidman, Shaltiel and Shmidman, Avi and Cohen, Amir David Nissan and Koppel, Moshe , journal=. Adapting

work page

[39] [39]

Hebatron: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model , institution =

Weinberger, Sarel and Kayzer, Noam and Revital, Dan and. Hebatron: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model , institution =. 2026 , note =

work page 2026

[40] [40]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Measuring Massive Multitask Language Understanding

Measuring Massive Multitask Language Understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009

[42] [42]

2023 , howpublished =

Tyqiangz , title =. 2023 , howpublished =

work page 2023

[43] [43]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

Evaluating Multilingual Word Representations , author =. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL) , year =

work page

[44] [44]

No Language Left Behind: Scaling Human-Centered Machine Translation

No Language Left Behind: Scaling Human-Centered Machine Translation , author=. arXiv preprint arXiv:2207.04672 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

2023 , publisher =

A framework for few-shot language model evaluation , author =. 2023 , publisher =

work page 2023

[46] [46]

Niwa, Takahiro and others , year =

work page

[47] [47]

2024 , month = aug, howpublished =

work page 2024