VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

arxiv: 2508.15229 · v3 · submitted 2025-08-21 · 💻 cs.CL · cs.AI· cs.LG

VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

Hanling Zhang , Yayu Zhou , Tongcheng Fang , Zhihang Yuan , Guohao Dai , Wanli Ouyang , Yu Wang This is my paper

Pith reviewed 2026-05-18 22:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords small language modelsdynamic vocabulary selectionmemory optimizationvocabulary pruningedge deploymenthybrid selectionlanguage model inference

0 comments p. Extension

The pith

VocabTailor dynamically selects vocabulary for small language models to cut memory use by up to 99 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Small language models face memory issues from large vocabularies in embeddings and language modeling heads. VocabTailor uses a decoupled framework that offloads embeddings and applies hybrid static-dynamic selection for the LM head. This approach leverages the lexical locality principle, where only a small subset of tokens is needed per inference. If successful, it allows SLMs to run on edge devices with far less memory while maintaining performance on downstream tasks. The method outperforms static pruning by avoiding rigid designs that lose information.

Core claim

The paper introduces VocabTailor, which achieves up to 99% reduction in memory usage of vocabulary-related components with minimal or no degradation in task performance by implementing on-demand loading through hybrid static-dynamic vocabulary selection.

What carries the argument

VocabTailor, a decoupled dynamic vocabulary selection framework that offloads embeddings and uses hybrid static-dynamic selection for the LM head.

If this is right

Memory usage of vocabulary components drops by as much as 99 percent.
Performance on downstream tasks stays nearly the same or unchanged.
The approach works better than fixed pruning methods that remove tokens permanently.
It supports on-demand loading for efficient use in resource-limited settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same idea of dynamic loading could help with other heavy parts of models such as attention layers.
Combining this with other compression techniques might allow capable models on even smaller hardware.
If the small-subset pattern differs across languages, adjustments may be needed for non-English tasks.

Load-bearing premise

Only a small subset of tokens is needed during any single inference without losing important information in the prefill stage.

What would settle it

An experiment on a task requiring many diverse tokens from the start that shows large performance drops would show the central claim is wrong.

Figures

Figures reproduced from arXiv: 2508.15229 by Guohao Dai, Hanling Zhang, Tongcheng Fang, Wanli Ouyang, Yayu Zhou, Yu Wang, Zhihang Yuan.

**Figure 1.** Figure 1: Overview of VocabTailor a 128K-token vocabulary, the embedding and LM head account for over 20% of the total memory usage. As SLMs are scaled down and deployed under tighter memory constraints, vocabulary-related memory inefficiencies become increasingly unsustainable, posing a fundamental barrier to efficient SLM deployment. To address this, prior work has explored static vocabulary pruning strategies (… view at source ↗

**Figure 2.** Figure 2: Left: Input-output lexical overlap ratio. Right: Example of input-output lexical overlap in summarization task. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss during the prefill stage and lack flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning. Our code is available at https://github.com/AwakenedInsects/VocabTailor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VocabTailor, a decoupled dynamic vocabulary selection framework for small language models. It is motivated by the lexical locality principle (only a small token subset is needed per inference) and computational asymmetry between embeddings and LM heads. The method offloads embeddings and applies a hybrid static-dynamic selection strategy to the LM head, enabling on-demand loading. Experiments across downstream tasks report up to 99% reduction in vocabulary-related memory with minimal or no performance degradation, outperforming static pruning baselines.

Significance. If the lexical locality assumption and hybrid strategy are robust, the work provides a practical engineering advance for deploying SLMs on memory-constrained edge devices. The open-sourced code at the provided GitHub link is a clear strength for reproducibility and further testing.

major comments (2)

[Experiments] The 99% memory-reduction claim and 'minimal degradation' results rest on the lexical locality principle plus the hybrid static-dynamic LM-head strategy. The manuscript does not report token-coverage statistics or ablations on prompt length, leaving open whether the dynamic slice suffices for longer or heterogeneous inputs where prefill may require tokens outside the static subset.
[Method] §3 (Method): The hybrid selection mechanism for the LM head is load-bearing for both the memory savings and correctness during prefill. The description of how the dynamic vocabulary slice is chosen on-demand and how logits are computed without information loss requires additional detail and pseudocode to allow verification that no tokens needed for the full sequence are dropped.

minor comments (2)

[Abstract] The abstract states 'comprehensive experiments across diverse downstream tasks' but does not list the specific tasks, datasets, or metrics (e.g., accuracy, perplexity). Adding these would strengthen the presentation.
[Figures/Tables] Figure captions and table headers should explicitly state the vocabulary size before and after reduction to make the 99% figure immediately interpretable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. We address the major comments point by point below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Experiments] The 99% memory-reduction claim and 'minimal degradation' results rest on the lexical locality principle plus the hybrid static-dynamic LM-head strategy. The manuscript does not report token-coverage statistics or ablations on prompt length, leaving open whether the dynamic slice suffices for longer or heterogeneous inputs where prefill may require tokens outside the static subset.

Authors: We acknowledge the importance of validating the lexical locality principle more rigorously. In the revised manuscript, we will add token-coverage statistics for various prompt lengths and input heterogeneities. We will also include ablation studies on prompt length to demonstrate that the dynamic vocabulary slice maintains sufficient coverage and performance even for longer sequences. These additions will strengthen the empirical support for our claims. revision: yes
Referee: [Method] §3 (Method): The hybrid selection mechanism for the LM head is load-bearing for both the memory savings and correctness during prefill. The description of how the dynamic vocabulary slice is chosen on-demand and how logits are computed without information loss requires additional detail and pseudocode to allow verification that no tokens needed for the full sequence are dropped.

Authors: We agree that the current description in §3 could benefit from greater clarity. We will revise this section to provide a more detailed explanation of the hybrid static-dynamic selection process, including how the dynamic slice is selected on-demand during inference. Furthermore, we will include pseudocode that details the selection mechanism and the computation of logits to ensure no tokens are inadvertently dropped, thereby preserving information for the full sequence. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework derived from stated observations and validated experimentally

full rationale

The paper's core contribution is a practical decoupled framework (VocabTailor) built on two explicitly stated empirical observations—the lexical locality principle and computational asymmetry between embedding and LM-head components—rather than any closed derivation or fitted parameter. No equations appear in the provided abstract or description that would reduce a claimed prediction to its own inputs by construction, nor are there load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The 99% memory reduction is presented as an experimental outcome across downstream tasks, not a mathematical identity. This is the common case of an honest engineering paper whose claims remain externally falsifiable via the reported benchmarks and released code.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions identified in the abstract and on the availability of suitable downstream task data for validation.

axioms (2)

domain assumption Lexical locality principle: only a small subset of tokens is required during any single inference
Stated as one of the two key principles underlying the vocabulary reduction challenge.
domain assumption Asymmetry in computational characteristics between vocabulary-related components of SLM
Stated as the second key principle enabling the decoupled offloading and hybrid selection strategy.

pith-pipeline@v0.9.0 · 5782 in / 1230 out tokens · 45260 ms · 2026-05-18T22:33:42.698616+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lexical locality principle: only a small subset of tokens is required during any single inference

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Anthropic

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609. Banerjee, S.; and Lavie, A

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Cobbe, K.; Kosaraju, V .; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Deutsch, D.; Briakou, E.; Caswell, I.; Finkelstein, M.; Ga- lor, R.; Juraska, J.; Kovacs, G.; Lui, A.; Rei, R.; Riesa, J.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2502.12404

Wmt24++: Expanding the language cover- age of wmt24 to 55 languages & dialects. arXiv preprint arXiv:2502.12404. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K

work page arXiv
[6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Burstein, J.; Doran, C.; and Solorio, T., eds., Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Min- nesota: Associa...

work page 2019
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Li, Y .; et al

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196. Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; and Hajishirzi, H

work page internal anchor Pith review Pith/arXiv arXiv
[9]

In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, 1152–1157

MAWPS: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, 1152–1157. Kudo, T.; and Richardson, J

work page 2016
[10]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Lamaakal, I.; Maleh, Y .; El Makkaoui, K.; Ouahbi, I.; Pławiak, P.; Alfarraj, O.; Almousa, M.; and Abd El-Latif, A. A

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rho-1: Not all tokens are what you need

Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965. Narayan, S.; Cohen, S. B.; and Lapata, M

work page arXiv
[12]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Don’t Give Me the Details, Just the Summary! Topic-Aware Con- volutional Neural Networks for Extreme Summarization. arXiv:1808.08745. Post, M

work page internal anchor Pith review Pith/arXiv arXiv
[13]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250. Rei, R.; Stewart, C.; Farinha, A. C.; and Lavie, A

work page internal anchor Pith review Pith/arXiv arXiv
[14]

In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685–2702

COMET: A Neural Framework for MT Evaluation. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685–2702. Sennrich, R.; Haddow, B.; and Birch, A

work page 2020
[15]

Team, G.; Georgiev, P.; Lei, V

Are Small Language Models Ready to Compete with Large Lan- guage Models for Practical Applications? arXiv preprint arXiv:2406.11402. Team, G.; Georgiev, P.; Lei, V . I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al

work page arXiv
[16]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi `ere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Touvro...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Lin- guistics: EMNLP 2023, 14725–14739

Ef- ficient Multilingual Language Model Compression through V ocabulary Trimming. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Lin- guistics: EMNLP 2023, 14725–14739. Singapore: Associa- tion for Computational Linguistics. Van Nguyen, C.; Shen, X.; Aponte, R.; Xia, Y .; Basu, S.; Hu, Z.; Chen, J.; Parmar, M.;...

work page 2023
[18]

arXiv preprint arXiv:2410.20011

A survey of small language models. arXiv preprint arXiv:2410.20011. Wang, Z.; Liu, S.; Sun, Y .; Li, H.; and Shen, K

work page arXiv
[19]

CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

Code- Contests+: High-Quality Test Case Generation for Compet- itive Programming. arXiv preprint arXiv:2506.05817. Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al

work page arXiv
[20]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Yang, Z.; Cui, Y .; and Chen, Z

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv:2502.01637

Scaling Embedding Layers in Language Models. arXiv:2502.01637. Zhang, B.; Williams, P.; Titov, I.; and Sennrich, R

work page arXiv
[22]

arXiv preprint arXiv:2004.11867

Im- proving massively multilingual neural machine translation and zero-shot translation. arXiv preprint arXiv:2004.11867

work page arXiv 2004

[1] [1]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Anthropic

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Qwen technical report. arXiv preprint arXiv:2309.16609. Banerjee, S.; and Lavie, A

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Cobbe, K.; Kosaraju, V .; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168. Deutsch, D.; Briakou, E.; Caswell, I.; Finkelstein, M.; Ga- lor, R.; Juraska, J.; Kovacs, G.; Lui, A.; Rei, R.; Riesa, J.; et al

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2502.12404

Wmt24++: Expanding the language cover- age of wmt24 to 55 languages & dialects. arXiv preprint arXiv:2502.12404. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K

work page arXiv

[6] [6]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Burstein, J.; Doran, C.; and Solorio, T., eds., Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Min- nesota: Associa...

work page 2019

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Li, Y .; et al

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196. Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; and Hajishirzi, H

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, 1152–1157

MAWPS: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, 1152–1157. Kudo, T.; and Richardson, J

work page 2016

[10] [10]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Lamaakal, I.; Maleh, Y .; El Makkaoui, K.; Ouahbi, I.; Pławiak, P.; Alfarraj, O.; Almousa, M.; and Abd El-Latif, A. A

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Rho-1: Not all tokens are what you need

Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965. Narayan, S.; Cohen, S. B.; and Lapata, M

work page arXiv

[12] [12]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Don’t Give Me the Details, Just the Summary! Topic-Aware Con- volutional Neural Networks for Extreme Summarization. arXiv:1808.08745. Post, M

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250. Rei, R.; Stewart, C.; Farinha, A. C.; and Lavie, A

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685–2702

COMET: A Neural Framework for MT Evaluation. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685–2702. Sennrich, R.; Haddow, B.; and Birch, A

work page 2020

[15] [15]

Team, G.; Georgiev, P.; Lei, V

Are Small Language Models Ready to Compete with Large Lan- guage Models for Practical Applications? arXiv preprint arXiv:2406.11402. Team, G.; Georgiev, P.; Lei, V . I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al

work page arXiv

[16] [16]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understand- ing across millions of tokens of context. arXiv preprint arXiv:2403.05530. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi `ere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. Touvro...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Lin- guistics: EMNLP 2023, 14725–14739

Ef- ficient Multilingual Language Model Compression through V ocabulary Trimming. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Lin- guistics: EMNLP 2023, 14725–14739. Singapore: Associa- tion for Computational Linguistics. Van Nguyen, C.; Shen, X.; Aponte, R.; Xia, Y .; Basu, S.; Hu, Z.; Chen, J.; Parmar, M.;...

work page 2023

[18] [18]

arXiv preprint arXiv:2410.20011

A survey of small language models. arXiv preprint arXiv:2410.20011. Wang, Z.; Liu, S.; Sun, Y .; Li, H.; and Shen, K

work page arXiv

[19] [19]

CodeContests+: High-quality test case generation for competitive programming.arXiv preprint arXiv:2506.05817, 2025

Code- Contests+: High-Quality Test Case Generation for Compet- itive Programming. arXiv preprint arXiv:2506.05817. Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al

work page arXiv

[20] [20]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388. Yang, Z.; Cui, Y .; and Chen, Z

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv:2502.01637

Scaling Embedding Layers in Language Models. arXiv:2502.01637. Zhang, B.; Williams, P.; Titov, I.; and Sennrich, R

work page arXiv

[22] [22]

arXiv preprint arXiv:2004.11867

Im- proving massively multilingual neural machine translation and zero-shot translation. arXiv preprint arXiv:2004.11867

work page arXiv 2004