pith. machine review for the scientific record. sign in

arxiv: 2605.13429 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords vocabulary adaptationtoken alignmentmultilingual LLMstext compressionparameter rearrangementtoken-level distillationLLM fine-tuning
0
0 comments X

The pith

By learning bilingual token alignments from monolingual representations, TokAlign++ rearranges parameters to adapt LLM vocabularies while preserving performance and boosting compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to handle inefficient tokenization and vocabulary mismatches that slow down LLMs and block knowledge transfer. It treats source and target vocabularies as separate languages and derives a bilingual token alignment lexicon directly from their monolingual representations. Parameters are rearranged according to this lexicon and then progressively fine-tuned for the new vocabulary. Tests across 15 languages demonstrate higher text compression rates and retention of most multilingual capabilities, with performance restored in roughly 1,000 steps. Once vocabularies are unified, the same alignment supports effective token-level distillation using only 235 million tokens.

Core claim

TokAlign++ advances vocabulary adaptation by deriving a bilingual token alignment lexicon from monolingual token representations, using it to rearrange model parameters for the target vocabulary, and applying progressive fine-tuning to recover and enhance performance on multilingual tasks.

What carries the argument

The bilingual token alignment lexicon derived from monolingual token representations, which supplies the mappings needed to rearrange parameters and initialize adaptation.

Load-bearing premise

A bilingual token alignment lexicon learned solely from monolingual token representations supplies mappings accurate enough for parameter rearrangement and progressive fine-tuning to succeed with only minor performance loss.

What would settle it

If adapted models still show large drops in multilingual task accuracy or compression rates after 1,000 fine-tuning steps compared with the original models, the central claim fails.

read the original abstract

Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TokAlign++, a vocabulary adaptation technique for LLMs. It treats source and target vocabularies as distinct languages, learns a bilingual token alignment lexicon solely from monolingual token representations, rearranges model parameters according to this lexicon, and applies progressive fine-tuning. Experiments across 15 languages are reported to improve multilingual text compression rates while largely preserving the original model's multilingual capabilities, with performance recovery in as few as 1k steps and enhanced token-level distillation using only 235M tokens after vocabulary unification.

Significance. If the reported gains are robustly supported, the method could offer a practical route to more efficient multilingual tokenization and cross-model knowledge transfer, particularly valuable for adapting models to new languages or vocabularies with minimal additional compute. The emphasis on low-step recovery and distillation efficiency addresses real deployment constraints in multilingual settings.

major comments (3)
  1. [Abstract / Experimental Results] Abstract and Experimental Results: The abstract claims improved compression rates and near-preservation of multilingual ability on 15 languages, yet provides no information on baselines (e.g., random alignment, embedding-based matching without rearrangement), statistical significance tests, or exact controls for training data volume and hyperparameters. This absence prevents verification that the gains are attributable to the proposed alignment rather than fine-tuning alone.
  2. [Method] Method description (core construction): The bilingual alignment lexicon is derived exclusively from monolingual token representations without parallel data or joint cross-lingual training. This creates a load-bearing assumption that embedding proximity yields functionally accurate mappings; for low-resource languages or differing scripts, spurious alignments could place the rearranged model far from the target loss landscape, rendering the 1k-step recovery and distillation claims dependent on untested correction speed.
  3. [Experimental Results] Distillation results: The claim that token-level distillation 'remarkably improves' the base model after vocabulary unification uses only 235M tokens, but no comparison is given to standard distillation baselines or to the performance of the original unified-vocabulary model before rearrangement. Without these controls, it is unclear whether the improvement stems from TokAlign++ or from the distillation procedure itself.
minor comments (2)
  1. [Method] Notation for the alignment lexicon and rearrangement step should be formalized with explicit equations to clarify how token IDs are mapped and parameters are copied or interpolated.
  2. The manuscript should include a limitations section discussing failure modes for languages with highly divergent morphologies or scripts, as the monolingual-alignment premise may not generalize uniformly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major point below and commit to revisions that strengthen the empirical support for TokAlign++.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results: The abstract claims improved compression rates and near-preservation of multilingual ability on 15 languages, yet provides no information on baselines (e.g., random alignment, embedding-based matching without rearrangement), statistical significance tests, or exact controls for training data volume and hyperparameters. This absence prevents verification that the gains are attributable to the proposed alignment rather than fine-tuning alone.

    Authors: We agree that explicit baselines and controls are required. In the revised manuscript we will add comparisons to (i) random token alignment and (ii) embedding-based matching without parameter rearrangement. We will also report statistical significance (paired t-tests over three random seeds) and provide exact training-data volumes, learning-rate schedules, and step counts for every experiment. These additions will isolate the contribution of the learned alignment lexicon from generic fine-tuning effects. revision: yes

  2. Referee: [Method] Method description (core construction): The bilingual alignment lexicon is derived exclusively from monolingual token representations without parallel data or joint cross-lingual training. This creates a load-bearing assumption that embedding proximity yields functionally accurate mappings; for low-resource languages or differing scripts, spurious alignments could place the rearranged model far from the target loss landscape, rendering the 1k-step recovery and distillation claims dependent on untested correction speed.

    Authors: We acknowledge that the method rests on the assumption that monolingual embedding proximity produces functionally useful mappings. Our experiments already cover 15 languages that include several low-resource and non-Latin-script cases, and we observe consistent 1k-step recovery. In the revision we will add an explicit limitations paragraph discussing the assumption, report alignment-quality diagnostics (e.g., top-1 accuracy on a small held-out parallel set where available), and note that recovery speed may vary for languages outside the current test suite. revision: partial

  3. Referee: [Experimental Results] Distillation results: The claim that token-level distillation 'remarkably improves' the base model after vocabulary unification uses only 235M tokens, but no comparison is given to standard distillation baselines or to the performance of the original unified-vocabulary model before rearrangement. Without these controls, it is unclear whether the improvement stems from TokAlign++ or from the distillation procedure itself.

    Authors: We will include the missing controls in the revised experimental section: (i) standard distillation applied directly to the unified vocabulary without TokAlign++ rearrangement and (ii) the performance of the original unified-vocabulary model prior to rearrangement. These comparisons will be reported alongside the 235M-token TokAlign++ distillation results, allowing readers to attribute gains specifically to the alignment step. revision: yes

Circularity Check

0 steps flagged

No circularity: alignment learning, rearrangement, and fine-tuning yield independently measured experimental outcomes

full rationale

The described chain learns a bilingual alignment lexicon from monolingual token representations, rearranges parameters according to the lexicon, and applies progressive fine-tuning. Reported results (compression rates on 15 languages, multilingual ability preservation, 1k-step recovery, and distillation gains on 235M tokens) are external benchmarks, not quantities defined by or equivalent to the alignment step itself. No equations, self-citations, or fitted-input-as-prediction reductions appear in the abstract or description that would collapse the claims to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method is presented at a high level without detailing the learning objective or assumptions used to derive the alignment lexicon.

pith-pipeline@v0.9.0 · 5498 in / 1147 out tokens · 79384 ms · 2026-05-14T19:30:54.808410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

136 extracted references · 77 canonical work pages · 6 internal anchors

  1. [1]

    This is an example of sample bibitem article title , journal =

    Surname, FirstName , year =. This is an example of sample bibitem article title , journal =

  2. [2]

    This is an example of sample bibitem article title , booktitle =

    Surname, FirstName , year =. This is an example of sample bibitem article title , booktitle =

  3. [3]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  4. [4]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  5. [5]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  6. [6]

    Non-Adversarial Unsupervised Word Translation

    Hoshen, Yedid and Wolf, Lior. Non-Adversarial Unsupervised Word Translation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018

  7. [7]

    MPNet: Masked and Permuted Pre-training for Language Understanding , volume =

    Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan , booktitle =. MPNet: Masked and Permuted Pre-training for Language Understanding , volume =

  8. [8]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , articleno =

    Balde, Gunjan and Roy, Soumyadeep and Mondal, Mainack and Ganguly, Niloy , title =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence , articleno =. 2024 , isbn =. doi:10.24963/ijcai.2024/683 , abstract =

  9. [9]

    ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) , volume=

    Implicit discourse relation recognition for English and Chinese with multiview modeling and effective representation learning , author=. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) , volume=. 2017 , publisher=

  10. [10]

    2024 , eprint=

    Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models , author=. 2024 , eprint=

  11. [11]

    On the Role of Seed Lexicons in Learning Bilingual Word Embeddings

    Vuli \'c , Ivan and Korhonen, Anna. On the Role of Seed Lexicons in Learning Bilingual Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1024

  12. [12]

    Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

    Marchisio, Kelly and Lewis, Patrick and Chen, Yihong and Artetxe, Mikel. Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.338

  13. [13]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Improving Language Plasticity via Pretraining with Active Forgetting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  14. [14]

    Mimicking Word Embeddings using Subword RNN s

    Pinter, Yuval and Guthrie, Robert and Eisenstein, Jacob. Mimicking Word Embeddings using Subword RNN s. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1010

  15. [15]

    Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts

    Schick, Timo and Sch. Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1048

  16. [16]

    BERTRAM : Improved Word Embeddings Have Big Impact on Contextualized Model Performance

    Schick, Timo and Sch. BERTRAM : Improved Word Embeddings Have Big Impact on Contextualized Model Performance. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.368

  17. [17]

    Cross-lingual Models of Word Embeddings: An Empirical Comparison

    Upadhyay, Shyam and Faruqui, Manaal and Dyer, Chris and Roth, Dan. Cross-lingual Models of Word Embeddings: An Empirical Comparison. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1157

  18. [18]

    Proceedings of the Thirty-Second

    Shaonan Wang and Jiajun Zhang and Chengqing Zong , title =. Proceedings of the Thirty-Second. 2018 , doi =

  19. [19]

    Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

    Jingxuan Wei and Linzhuang Sun and Yichong Leng and Xu Tan and Bihui Yu and Ruifeng Guo , title =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

  20. [20]

    The Twelfth International Conference on Learning Representations , year=

    Llemma: An Open Language Model for Mathematics , author=. The Twelfth International Conference on Learning Representations , year=

  21. [21]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  22. [22]

    The Stack: 3

    Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu. The Stack: 3. Transactions on Machine Learning Research , issn=. 2023 , note=

  23. [23]

    and Neubig, Graham

    Patra, Barun and Moniz, Joel Ruben Antony and Garg, Sarthak and Gormley, Matthew R. and Neubig, Graham. Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1018

  24. [24]

    Extending Multilingual BERT to Low-Resource Languages

    Wang, Zihan and K, Karthikeyan and Mayhew, Stephen and Roth, Dan. Extending Multilingual BERT to Low-Resource Languages. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.240

  25. [25]

    UNK s Everywhere: A dapting Multilingual Language Models to New Scripts

    Pfeiffer, Jonas and Vuli \'c , Ivan and Gurevych, Iryna and Ruder, Sebastian. UNK s Everywhere: A dapting Multilingual Language Models to New Scripts. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.800

  26. [26]

    Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,

    Qingcheng Zeng and Lucas Garay and Peilin Zhou and Dading Chong and Yining Hua and Jiageng Wu and Yikang Pan and Han Zhou and Rob Voigt and Jie Yang , title =. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , doi =

  27. [27]

    As Good as New

    de Vries, Wietse and Nissim, Malvina. As Good as New. How to Successfully Recycle E nglish GPT -2 to Make Models for Other Languages. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.74

  28. [28]

    Semi-Supervised Bilingual Lexicon Induction with Two-way Interaction

    Zhao, Xu and Wang, Zihao and Wu, Hao and Zhang, Yong. Semi-Supervised Bilingual Lexicon Induction with Two-way Interaction. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.238

  29. [29]

    A Relaxed Matching Procedure for Unsupervised BLI

    Zhao, Xu and Wang, Zihao and Zhang, Yong and Wu, Hao. A Relaxed Matching Procedure for Unsupervised BLI. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.274

  30. [30]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  31. [31]

    Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach

    Jawanpuria, Pratik and Balgovind, Arjun and Kunchukuttan, Anoop and Mishra, Bamdev. Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00257

  32. [32]

    LNM ap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

    Mohiuddin, Tasnim and Bari, M Saiful and Joty, Shafiq. LNM ap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.215

  33. [33]

    The Twelfth International Conference on Learning Representations , year=

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

  34. [34]

    Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces

    Glava s , Goran and Vuli \'c , Ivan. Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.675

  35. [35]

    Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion

    Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J \'e gou, Herv \'e and Grave, Edouard. Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1330

  36. [36]

    arXiv preprint arXiv:2309.09400 , year=

    Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages , author=. arXiv preprint arXiv:2309.09400 , year=

  37. [37]

    2023 , journal=

    GPT-4 Technical Report , author=. 2023 , journal=

  38. [38]

    2023 , journal=

    Introducing the next generation of Claude , author=. 2023 , journal=

  39. [39]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  40. [40]

    A Simple, Straightforward and Effective Model for Joint Bilingual Terms Detection and Word Alignment in SMT

    Huang, Guoping and Zhang, Jiajun and Zhou, Yu and Zong, Chengqing. A Simple, Straightforward and Effective Model for Joint Bilingual Terms Detection and Word Alignment in SMT. Natural Language Understanding and Intelligent Applications. 2016

  41. [41]

    2015 , issue_date =

    Cui, Qing and Gao, Bin and Bian, Jiang and Qiu, Siyu and Dai, Hanjun and Liu, Tie-Yan , title =. 2015 , issue_date =. doi:10.1145/2797137 , journal =

  42. [42]

    Associative Multichannel Autoencoder for Multimodal Word Representation

    Wang, Shaonan and Zhang, Jiajun and Zong, Chengqing. Associative Multichannel Autoencoder for Multimodal Word Representation. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1011

  43. [43]

    Multilingual word translation using auxiliary languages

    Taitelbaum, Hagai and Chechik, Gal and Goldberger, Jacob. Multilingual word translation using auxiliary languages. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1134

  44. [44]

    Unsupervised Morphology Induction Using Word Embeddings

    Soricut, Radu and Och, Franz. Unsupervised Morphology Induction Using Word Embeddings. Proceedings of the 2015 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015

  45. [45]

    Large-scale Word Alignment Using Soft Dependency Cohesion Constraints

    Wang, Zhiguo and Zong, Chengqing. Large-scale Word Alignment Using Soft Dependency Cohesion Constraints. Transactions of the Association for Computational Linguistics. 2013. doi:10.1162/tacl_a_00228

  46. [46]

    Proceedings of The 33rd International Conference on Machine Learning , pages =

    From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , author =. Proceedings of The 33rd International Conference on Machine Learning , pages =. 2016 , volume =

  47. [47]

    2024 , journal=

    Introducing Meta Llama 3: The most capable openly available LLM to date , author=. 2024 , journal=

  48. [48]

    2024 , journal=

    TinyLlama: An Open-Source Small Language Model , author=. 2024 , journal=

  49. [49]

    2024 , journal=

    Introducing Qwen1.5 , author=. 2024 , journal=

  50. [50]

    Efficient and effective text encoding for chinese llama and alpaca

    Efficient and effective text encoding for chinese llama and alpaca , author=. arXiv preprint arXiv:2304.08177 , year=

  51. [51]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  52. [52]

    International Conference on Learning Representations , year=

    Mixed Precision Training , author=. International Conference on Learning Representations , year=

  53. [53]

    2018 , journal=

    Improving language understanding by generative pre-training , author=. 2018 , journal=

  54. [54]

    OpenAI blog , year=

    Language models are unsupervised multitask learners , author=. OpenAI blog , year=

  55. [55]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , volume =

    Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R\'. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , volume =. Advances in Neural Information Processing Systems , publisher =

  56. [56]

    Language Models are Few-Shot Learners , volume =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  57. [57]

    2023 , eprint=

    LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

  58. [58]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  59. [59]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

  60. [60]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Rasley, Jeff and Rajbhandari, Samyam and Ruwase, Olatunji and He, Yuxiong , title =. 2020 , isbn =. doi:10.1145/3394486.3406703 , booktitle =

  61. [61]

    arXiv preprint arXiv:2310.19341 , year=

    Skywork: A more open bilingual foundation model , author=. arXiv preprint arXiv:2310.19341 , year=

  62. [62]

    arXiv preprint arXiv:2407.16607 , year=

    Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? , author=. arXiv preprint arXiv:2407.16607 , year=

  63. [63]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  64. [64]

    arXiv preprint arXiv:2002.07306 , year=

    From english to foreign languages: Transferring pre-trained language models , author=. arXiv preprint arXiv:2002.07306 , year=

  65. [65]

    A Graph-based Coarse-to-fine Method for Unsupervised Bilingual Lexicon Induction

    Ren, Shuo and Liu, Shujie and Zhou, Ming and Ma, Shuai. A Graph-based Coarse-to-fine Method for Unsupervised Bilingual Lexicon Induction. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.318

  66. [66]

    Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction

    Vuli \'c , Ivan and Moens, Marie-Francine. Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. doi:10.3115/v1...

  67. [67]

    Multi-Adversarial Learning for Cross-Lingual Word Embeddings

    Wang, Haozhou and Henderson, James and Merlo, Paola. Multi-Adversarial Learning for Cross-Lingual Word Embeddings. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.39

  68. [68]

    RAPO : An Adaptive Ranking Paradigm for Bilingual Lexicon Induction

    Tian, Zhoujin and Li, Chaozhuo and Ren, Shuo and Zuo, Zhiqiang and Wen, Zengxuan and Hu, Xinyue and Han, Xiao and Huang, Haizhen and Deng, Denvy and Zhang, Qi and Xie, Xing. RAPO : An Adaptive Ranking Paradigm for Bilingual Lexicon Induction. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022...

  69. [69]

    arXiv preprint arXiv:2308.04948 , year=

    Extrapolating large language models to non-english by aligning languages , author=. arXiv preprint arXiv:2308.04948 , year=

  70. [70]

    arXiv preprint arXiv:2305.18098 , year=

    Bigtranslate: Augmenting large language models with multilingual translation capability over 100 languages , author=. arXiv preprint arXiv:2305.18098 , year=

  71. [71]

    2017 , journal=

    Enriching Word Vectors with Subword Information , author=. 2017 , journal=

  72. [72]

    2023 , journal=

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , author=. 2023 , journal=

  73. [73]

    Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =

  74. [74]

    2018 , eprint=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

  75. [75]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Bisk, Yonatan and Zellers, Rowan and Le bras, Ronan and Gao, Jianfeng and Choi, Yejin , year=. PIQA: Reasoning about Physical Commonsense in Natural Language , booktitle =. doi:10.1609/aaai.v34i05.6239 , journal=

  76. [76]

    2024 , eprint=

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools , author=. 2024 , eprint=

  77. [77]

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  78. [78]

    2016 , eprint=

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , author=. 2016 , eprint=

  79. [79]

    The Eleventh International Conference on Learning Representations , year=

    Relative representations enable zero-shot latent space communication , author=. The Eleventh International Conference on Learning Representations , year=

  80. [80]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6399 , number=

Showing first 80 references.