arxiv: 2605.13429 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

Chong Li , Yingzhuo Deng , Wen Yang , Jiajun Zhang , Chengqing Zong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords vocabulary adaptationtoken alignmentmultilingual LLMstext compressionparameter rearrangementtoken-level distillationLLM fine-tuning

0 comments

The pith

By learning bilingual token alignments from monolingual representations, TokAlign++ rearranges parameters to adapt LLM vocabularies while preserving performance and boosting compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to handle inefficient tokenization and vocabulary mismatches that slow down LLMs and block knowledge transfer. It treats source and target vocabularies as separate languages and derives a bilingual token alignment lexicon directly from their monolingual representations. Parameters are rearranged according to this lexicon and then progressively fine-tuned for the new vocabulary. Tests across 15 languages demonstrate higher text compression rates and retention of most multilingual capabilities, with performance restored in roughly 1,000 steps. Once vocabularies are unified, the same alignment supports effective token-level distillation using only 235 million tokens.

Core claim

TokAlign++ advances vocabulary adaptation by deriving a bilingual token alignment lexicon from monolingual token representations, using it to rearrange model parameters for the target vocabulary, and applying progressive fine-tuning to recover and enhance performance on multilingual tasks.

What carries the argument

The bilingual token alignment lexicon derived from monolingual token representations, which supplies the mappings needed to rearrange parameters and initialize adaptation.

Load-bearing premise

A bilingual token alignment lexicon learned solely from monolingual token representations supplies mappings accurate enough for parameter rearrangement and progressive fine-tuning to succeed with only minor performance loss.

What would settle it

If adapted models still show large drops in multilingual task accuracy or compression rates after 1,000 fine-tuning steps compared with the original models, the central claim fails.

read the original abstract

Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TokAlign++ gives a workable new framing for vocab alignment but the abstract leaves the actual gains hard to judge.

read the letter

The core idea is to treat source and target vocabularies as two languages, build a bilingual token alignment lexicon from monolingual representations alone, rearrange the model parameters according to that lexicon, and then do progressive fine-tuning. That framing and the rearrangement step look like the actual novelty compared with standard vocab adaptation tricks that usually rely on parallel data or joint training. The reported outcomes on 15 languages—better compression, most multilingual capability retained, recovery in roughly 1k steps, and solid distillation gains after unification with only 235M tokens—are the practical hook that makes the work worth a look for anyone doing multilingual LLM work on a budget.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TokAlign++, a vocabulary adaptation technique for LLMs. It treats source and target vocabularies as distinct languages, learns a bilingual token alignment lexicon solely from monolingual token representations, rearranges model parameters according to this lexicon, and applies progressive fine-tuning. Experiments across 15 languages are reported to improve multilingual text compression rates while largely preserving the original model's multilingual capabilities, with performance recovery in as few as 1k steps and enhanced token-level distillation using only 235M tokens after vocabulary unification.

Significance. If the reported gains are robustly supported, the method could offer a practical route to more efficient multilingual tokenization and cross-model knowledge transfer, particularly valuable for adapting models to new languages or vocabularies with minimal additional compute. The emphasis on low-step recovery and distillation efficiency addresses real deployment constraints in multilingual settings.

major comments (3)

[Abstract / Experimental Results] Abstract and Experimental Results: The abstract claims improved compression rates and near-preservation of multilingual ability on 15 languages, yet provides no information on baselines (e.g., random alignment, embedding-based matching without rearrangement), statistical significance tests, or exact controls for training data volume and hyperparameters. This absence prevents verification that the gains are attributable to the proposed alignment rather than fine-tuning alone.
[Method] Method description (core construction): The bilingual alignment lexicon is derived exclusively from monolingual token representations without parallel data or joint cross-lingual training. This creates a load-bearing assumption that embedding proximity yields functionally accurate mappings; for low-resource languages or differing scripts, spurious alignments could place the rearranged model far from the target loss landscape, rendering the 1k-step recovery and distillation claims dependent on untested correction speed.
[Experimental Results] Distillation results: The claim that token-level distillation 'remarkably improves' the base model after vocabulary unification uses only 235M tokens, but no comparison is given to standard distillation baselines or to the performance of the original unified-vocabulary model before rearrangement. Without these controls, it is unclear whether the improvement stems from TokAlign++ or from the distillation procedure itself.

minor comments (2)

[Method] Notation for the alignment lexicon and rearrangement step should be formalized with explicit equations to clarify how token IDs are mapped and parameters are copied or interpolated.
The manuscript should include a limitations section discussing failure modes for languages with highly divergent morphologies or scripts, as the monolingual-alignment premise may not generalize uniformly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major point below and commit to revisions that strengthen the empirical support for TokAlign++.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results: The abstract claims improved compression rates and near-preservation of multilingual ability on 15 languages, yet provides no information on baselines (e.g., random alignment, embedding-based matching without rearrangement), statistical significance tests, or exact controls for training data volume and hyperparameters. This absence prevents verification that the gains are attributable to the proposed alignment rather than fine-tuning alone.

Authors: We agree that explicit baselines and controls are required. In the revised manuscript we will add comparisons to (i) random token alignment and (ii) embedding-based matching without parameter rearrangement. We will also report statistical significance (paired t-tests over three random seeds) and provide exact training-data volumes, learning-rate schedules, and step counts for every experiment. These additions will isolate the contribution of the learned alignment lexicon from generic fine-tuning effects. revision: yes
Referee: [Method] Method description (core construction): The bilingual alignment lexicon is derived exclusively from monolingual token representations without parallel data or joint cross-lingual training. This creates a load-bearing assumption that embedding proximity yields functionally accurate mappings; for low-resource languages or differing scripts, spurious alignments could place the rearranged model far from the target loss landscape, rendering the 1k-step recovery and distillation claims dependent on untested correction speed.

Authors: We acknowledge that the method rests on the assumption that monolingual embedding proximity produces functionally useful mappings. Our experiments already cover 15 languages that include several low-resource and non-Latin-script cases, and we observe consistent 1k-step recovery. In the revision we will add an explicit limitations paragraph discussing the assumption, report alignment-quality diagnostics (e.g., top-1 accuracy on a small held-out parallel set where available), and note that recovery speed may vary for languages outside the current test suite. revision: partial
Referee: [Experimental Results] Distillation results: The claim that token-level distillation 'remarkably improves' the base model after vocabulary unification uses only 235M tokens, but no comparison is given to standard distillation baselines or to the performance of the original unified-vocabulary model before rearrangement. Without these controls, it is unclear whether the improvement stems from TokAlign++ or from the distillation procedure itself.

Authors: We will include the missing controls in the revised experimental section: (i) standard distillation applied directly to the unified vocabulary without TokAlign++ rearrangement and (ii) the performance of the original unified-vocabulary model prior to rearrangement. These comparisons will be reported alongside the 235M-token TokAlign++ distillation results, allowing readers to attribute gains specifically to the alignment step. revision: yes

Circularity Check

0 steps flagged

No circularity: alignment learning, rearrangement, and fine-tuning yield independently measured experimental outcomes

full rationale

The described chain learns a bilingual alignment lexicon from monolingual token representations, rearranges parameters according to the lexicon, and applies progressive fine-tuning. Reported results (compression rates on 15 languages, multilingual ability preservation, 1k-step recovery, and distillation gains on 235M tokens) are external benchmarks, not quantities defined by or equivalent to the alignment step itself. No equations, self-citations, or fitted-input-as-prediction reductions appear in the abstract or description that would collapse the claims to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method is presented at a high level without detailing the learning objective or assumptions used to derive the alignment lexicon.

pith-pipeline@v0.9.0 · 5498 in / 1147 out tokens · 79384 ms · 2026-05-14T19:30:54.808410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

136 extracted references · 77 canonical work pages · 6 internal anchors

[1]

This is an example of sample bibitem article title , journal =

Surname, FirstName , year =. This is an example of sample bibitem article title , journal =
[2]

This is an example of sample bibitem article title , booktitle =

Surname, FirstName , year =. This is an example of sample bibitem article title , booktitle =
[3]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[4]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[5]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[6]

Non-Adversarial Unsupervised Word Translation

Hoshen, Yedid and Wolf, Lior. Non-Adversarial Unsupervised Word Translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

2018
[7]

MPNet: Masked and Permuted Pre-training for Language Understanding , volume =

Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan , booktitle =. MPNet: Masked and Permuted Pre-training for Language Understanding , volume =
[8]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , articleno =

Balde, Gunjan and Roy, Soumyadeep and Mondal, Mainack and Ganguly, Niloy , title =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , articleno =. 2024 , isbn =. doi:10.24963/ijcai.2024/683 , abstract =

work page doi:10.24963/ijcai.2024/683 2024
[9]

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) , volume=

Implicit discourse relation recognition for English and Chinese with multiview modeling and effective representation learning , author=. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) , volume=. 2017 , publisher=

2017
[10]

2024 , eprint=

Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models , author=. 2024 , eprint=

2024
[11]

On the Role of Seed Lexicons in Learning Bilingual Word Embeddings

Vuli \'c , Ivan and Korhonen, Anna. On the Role of Seed Lexicons in Learning Bilingual Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1024

work page doi:10.18653/v1/p16-1024 2016
[12]

Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

Marchisio, Kelly and Lewis, Patrick and Chen, Yihong and Artetxe, Mikel. Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.338

work page doi:10.18653/v1/2023.findings-acl.338 2023
[13]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Improving Language Plasticity via Pretraining with Active Forgetting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[14]

Mimicking Word Embeddings using Subword RNN s

Pinter, Yuval and Guthrie, Robert and Eisenstein, Jacob. Mimicking Word Embeddings using Subword RNN s. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1010

work page doi:10.18653/v1/d17-1010 2017
[15]

Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts

Schick, Timo and Sch. Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1048

work page doi:10.18653/v1/n19-1048 2019
[16]

BERTRAM : Improved Word Embeddings Have Big Impact on Contextualized Model Performance

Schick, Timo and Sch. BERTRAM : Improved Word Embeddings Have Big Impact on Contextualized Model Performance. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.368

work page doi:10.18653/v1/2020.acl-main.368 2020
[17]

Cross-lingual Models of Word Embeddings: An Empirical Comparison

Upadhyay, Shyam and Faruqui, Manaal and Dyer, Chris and Roth, Dan. Cross-lingual Models of Word Embeddings: An Empirical Comparison. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1157

work page doi:10.18653/v1/p16-1157 2016
[18]

Proceedings of the Thirty-Second

Shaonan Wang and Jiajun Zhang and Chengqing Zong , title =. Proceedings of the Thirty-Second. 2018 , doi =

2018
[19]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

Jingxuan Wei and Linzhuang Sun and Yichong Leng and Xu Tan and Bihui Yu and Ruifeng Guo , title =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,
[20]

The Twelfth International Conference on Learning Representations , year=

Llemma: An Open Language Model for Mathematics , author=. The Twelfth International Conference on Learning Representations , year=
[21]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[22]

The Stack: 3

Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu. The Stack: 3. Transactions on Machine Learning Research , issn=. 2023 , note=

2023
[23]

and Neubig, Graham

Patra, Barun and Moniz, Joel Ruben Antony and Garg, Sarthak and Gormley, Matthew R. and Neubig, Graham. Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1018

work page doi:10.18653/v1/p19-1018 2019
[24]

Extending Multilingual BERT to Low-Resource Languages

Wang, Zihan and K, Karthikeyan and Mayhew, Stephen and Roth, Dan. Extending Multilingual BERT to Low-Resource Languages. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.240

work page doi:10.18653/v1/2020.findings-emnlp.240 2020
[25]

UNK s Everywhere: A dapting Multilingual Language Models to New Scripts

Pfeiffer, Jonas and Vuli \'c , Ivan and Gurevych, Iryna and Ruder, Sebastian. UNK s Everywhere: A dapting Multilingual Language Models to New Scripts. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.800

work page doi:10.18653/v1/2021.emnlp-main.800 2021
[26]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,

Qingcheng Zeng and Lucas Garay and Peilin Zhou and Dading Chong and Yining Hua and Jiageng Wu and Yikang Pan and Han Zhou and Rob Voigt and Jie Yang , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , doi =

2023
[27]

As Good as New

de Vries, Wietse and Nissim, Malvina. As Good as New. How to Successfully Recycle E nglish GPT -2 to Make Models for Other Languages. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.74

work page doi:10.18653/v1/2021.findings-acl.74 2021
[28]

Semi-Supervised Bilingual Lexicon Induction with Two-way Interaction

Zhao, Xu and Wang, Zihao and Wu, Hao and Zhang, Yong. Semi-Supervised Bilingual Lexicon Induction with Two-way Interaction. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.238

work page doi:10.18653/v1/2020.emnlp-main.238 2020
[29]

A Relaxed Matching Procedure for Unsupervised BLI

Zhao, Xu and Wang, Zihao and Zhang, Yong and Wu, Hao. A Relaxed Matching Procedure for Unsupervised BLI. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.274

work page doi:10.18653/v1/2020.acl-main.274 2020
[30]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[31]

Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach

Jawanpuria, Pratik and Balgovind, Arjun and Kunchukuttan, Anoop and Mishra, Bamdev. Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00257

work page doi:10.1162/tacl_a_00257 2019
[32]

LNM ap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

Mohiuddin, Tasnim and Bari, M Saiful and Joty, Shafiq. LNM ap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.215

work page doi:10.18653/v1/2020.emnlp-main.215 2020
[33]

The Twelfth International Conference on Learning Representations , year=

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=
[34]

Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces

Glava s , Goran and Vuli \'c , Ivan. Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.675

work page doi:10.18653/v1/2020.acl-main.675 2020
[35]

Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion

Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J \'e gou, Herv \'e and Grave, Edouard. Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1330

work page doi:10.18653/v1/d18-1330 2018
[36]

arXiv preprint arXiv:2309.09400 , year=

Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages , author=. arXiv preprint arXiv:2309.09400 , year=

work page arXiv
[37]

2023 , journal=

GPT-4 Technical Report , author=. 2023 , journal=

2023
[38]

2023 , journal=

Introducing the next generation of Claude , author=. 2023 , journal=

2023
[39]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

A Simple, Straightforward and Effective Model for Joint Bilingual Terms Detection and Word Alignment in SMT

Huang, Guoping and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing. A Simple, Straightforward and Effective Model for Joint Bilingual Terms Detection and Word Alignment in SMT. Natural Language Understanding and Intelligent Applications. 2016

2016
[41]

2015 , issue_date =

Cui, Qing and Gao, Bin and Bian, Jiang and Qiu, Siyu and Dai, Hanjun and Liu, Tie-Yan , title =. 2015 , issue_date =. doi:10.1145/2797137 , journal =

work page doi:10.1145/2797137 2015
[42]

Associative Multichannel Autoencoder for Multimodal Word Representation

Wang, Shaonan and Zhang, Jiajun and Zong, Chengqing. Associative Multichannel Autoencoder for Multimodal Word Representation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1011

work page doi:10.18653/v1/d18-1011 2018
[43]

Multilingual word translation using auxiliary languages

Taitelbaum, Hagai and Chechik, Gal and Goldberger, Jacob. Multilingual word translation using auxiliary languages. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1134

work page doi:10.18653/v1/d19-1134 2019
[44]

Unsupervised Morphology Induction Using Word Embeddings

Soricut, Radu and Och, Franz. Unsupervised Morphology Induction Using Word Embeddings. Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015

2015
[45]

Large-scale Word Alignment Using Soft Dependency Cohesion Constraints

Wang, Zhiguo and Zong, Chengqing. Large-scale Word Alignment Using Soft Dependency Cohesion Constraints. Transactions of the Association for Computational Linguistics. 2013. doi:10.1162/tacl_a_00228

work page doi:10.1162/tacl_a_00228 2013
[46]

Proceedings of The 33rd International Conference on Machine Learning , pages =

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , author =. Proceedings of The 33rd International Conference on Machine Learning , pages =. 2016 , volume =

2016
[47]

2024 , journal=

Introducing Meta Llama 3: The most capable openly available LLM to date , author=. 2024 , journal=

2024
[48]

2024 , journal=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , journal=

2024
[49]

2024 , journal=

Introducing Qwen1.5 , author=. 2024 , journal=

2024
[50]

Efficient and effective text encoding for chinese llama and alpaca

Efficient and effective text encoding for chinese llama and alpaca , author=. arXiv preprint arXiv:2304.08177 , year=

work page arXiv
[51]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[52]

International Conference on Learning Representations , year=

Mixed Precision Training , author=. International Conference on Learning Representations , year=
[53]

2018 , journal=

Improving language understanding by generative pre-training , author=. 2018 , journal=

2018
[54]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=
[55]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , volume =

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R\'. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , volume =. Advances in Neural Information Processing Systems , publisher =
[56]

Language Models are Few-Shot Learners , volume =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
[57]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

2023
[58]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

2020
[59]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

2022
[60]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. 2020 , isbn =. doi:10.1145/3394486.3406703 , booktitle =

work page doi:10.1145/3394486.3406703 2020
[61]

arXiv preprint arXiv:2310.19341 , year=

Skywork: A more open bilingual foundation model , author=. arXiv preprint arXiv:2310.19341 , year=

work page arXiv
[62]

arXiv preprint arXiv:2407.16607 , year=

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , author=. arXiv preprint arXiv:2407.16607 , year=

work page arXiv
[63]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2002.07306 , year=

From english to foreign languages: Transferring pre-trained language models , author=. arXiv preprint arXiv:2002.07306 , year=

work page arXiv 2002
[65]

A Graph-based Coarse-to-fine Method for Unsupervised Bilingual Lexicon Induction

Ren, Shuo and Liu, Shujie and Zhou, Ming and Ma, Shuai. A Graph-based Coarse-to-fine Method for Unsupervised Bilingual Lexicon Induction. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.318

work page doi:10.18653/v1/2020.acl-main.318 2020
[66]

Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction

Vuli \'c , Ivan and Moens, Marie-Francine. Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3115/v1...

work page doi:10.3115/v1/p15-2118 2015
[67]

Multi-Adversarial Learning for Cross-Lingual Word Embeddings

Wang, Haozhou and Henderson, James and Merlo, Paola. Multi-Adversarial Learning for Cross-Lingual Word Embeddings. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.39

work page doi:10.18653/v1/2021.naacl-main.39 2021
[68]

RAPO : An Adaptive Ranking Paradigm for Bilingual Lexicon Induction

Tian, Zhoujin and Li, Chaozhuo and Ren, Shuo and Zuo, Zhiqiang and Wen, Zengxuan and Hu, Xinyue and Han, Xiao and Huang, Haizhen and Deng, Denvy and Zhang, Qi and Xie, Xing. RAPO : An Adaptive Ranking Paradigm for Bilingual Lexicon Induction. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022...

work page doi:10.18653/v1/2022.emnlp-main.606 2022
[69]

arXiv preprint arXiv:2308.04948 , year=

Extrapolating large language models to non-english by aligning languages , author=. arXiv preprint arXiv:2308.04948 , year=

work page arXiv
[70]

arXiv preprint arXiv:2305.18098 , year=

Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages , author=. arXiv preprint arXiv:2305.18098 , year=

work page arXiv
[71]

2017 , journal=

Enriching Word Vectors with Subword Information , author=. 2017 , journal=

2017
[72]

2023 , journal=

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , author=. 2023 , journal=

2023
[73]

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =
[74]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

2018
[75]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Bisk, Yonatan and Zellers, Rowan and Le bras, Ronan and Gao, Jianfeng and Choi, Yejin , year=. PIQA: Reasoning about Physical Commonsense in Natural Language , booktitle =. doi:10.1609/aaai.v34i05.6239 , journal=

work page doi:10.1609/aaai.v34i05.6239
[76]

2024 , eprint=

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools , author=. 2024 , eprint=

2024
[77]

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
[78]

2016 , eprint=

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. 2016 , eprint=

2016
[79]

The Eleventh International Conference on Learning Representations , year=

Relative representations enable zero-shot latent space communication , author=. The Eleventh International Conference on Learning Representations , year=
[80]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , number=

work page doi:10.1609/aaai.v34i05.6399 2020

Showing first 80 references.