VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
Pith reviewed 2026-05-22 09:25 UTC · model grok-4.3
The pith
Rebalancing the supervised fine-tuning mix toward tool-use examples lets a 42-million-parameter Spanish cybersecurity model reach 0.23 tool-selection accuracy while retaining conversational performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the zero floor previously observed on the tool-selection benchmark B4 is a corpus-density artifact rather than a capacity limit. After adjusting the SFT mixture to a 1:21 tool-use ratio, the 42M-parameter VectraYX-Nano v7 reaches B4 = 0.230 +/- 0.052, while holding B1 at 0.332 +/- 0.005 and B5 at 0.725 +/- 0.130; a LoRA adaptation of a 260M from-scratch model reaches 0.445 +/- 0.201 on the same benchmark. Curriculum replay across three phases produces monotonic loss reduction, and bootstrap-corpus ablations reveal that lower-perplexity general Spanish data harms conversational gate performance on B5.
What carries the argument
Rebalancing the SFT mixture to a 1:21 tool-use ratio, which supplies enough examples for the model to learn native tool invocation through the Model Context Protocol while preserving prior conversational and domain skills.
If this is right
- Curriculum phases with replay produce steady loss improvement across conversational, cybersecurity, and tooling data.
- A 42M-parameter model can run sub-second on commodity hardware once trained with the described architecture and data mix.
- Spanish-native tool use becomes feasible at this scale without English pretraining or massive parameter counts.
- The same rebalancing approach can be applied to other technical domains where tool invocation is required.
Where Pith is reading between the lines
- If the density explanation holds, similar rebalancing could unlock tool use in other small models for low-resource languages.
- Native MCP integration at 42M parameters opens the possibility of lightweight agentic workflows for regional cybersecurity teams.
- The observed loss-versus-register inversion suggests that domain-specific register matters more than raw perplexity when selecting bootstrap corpora.
Load-bearing premise
The custom benchmarks B4 and B5 are treated as reliable proxies for actual cybersecurity utility in the real world.
What would settle it
Running the released model on a set of previously unseen real-world cybersecurity tool-selection and conversational tasks and finding that its accuracy remains near zero despite the reported B4 and B5 scores.
Figures
read the original abstract
We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model has four contributions. (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus assembled by an eight-VM distributed pipeline at ~$25 USD of cloud compute and split into three curriculum phases (conversational 42M, cybersecurity 118M, offensive tooling 10M). (ii) Architecture: a 42M Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE and z-loss, paired with a domain-balanced 16,384-token byte-fallback BPE. (iii) Curriculum with replay across the three phases yields a monotonic loss descent (9.80 -> 3.17 -> 3.00 -> 2.16); after SFT (loss 1.74) the v2 bootstrap-ablation reference attains a conversational gate of 0.775 +/- 0.043 on B5 over N=4 seeds, and a controlled Phase-2 replay sweep over {0,5,10,25,50}% saturates B5 at >=25% replay. (iv) Two empirical findings, both N=4. A controlled bootstrap-corpus ablation across v2 (OpenSubs), v4 (mC4-ES), and v6 (60/25/15 OpenSubs/mC4/Wiki) exposes a loss-versus-register inversion: lower-perplexity bootstraps yield measurably worse conversational behavior (v2 > v4 > v6 on B5 at every paired seed). The B4 (tool-selection) floor of 0.000 is a corpus-density artifact, not a capacity gate: rebalancing the SFT mixture to tool-use ratio 1:21 yields VectraYX-Nano v7, the released headline configuration, reaching B4 = 0.230 +/- 0.052 at 42M while retaining B1 = 0.332 +/- 0.005 and B5 = 0.725 +/- 0.130; a LoRA replication on a 260M from-scratch mid-tier reaches 0.445 +/- 0.201. The released GGUF is 96 MB in F16, runs sub-second TTFT on commodity hardware under llama.cpp, and is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents VectraYX-Nano, a 41.95M-parameter decoder-only Transformer trained from scratch on the 170M-token VectraYX-Sec-ES Spanish cybersecurity corpus assembled at low cost. It uses a three-phase curriculum (conversational 42M, cybersecurity 118M, offensive tooling 10M) with replay, GQA/RoPE/SwiGLU architecture, and SFT with rebalanced tool-use ratios to enable native MCP tool invocation. Empirical results from N=4 seed runs show monotonic loss descent, a loss-versus-register inversion across bootstrap corpora, saturation of B5 at >=25% replay, and that a 1:21 tool-use SFT ratio lifts B4 from 0.000 to 0.230 +/- 0.052 while retaining B5 = 0.725 +/- 0.130; a 260M LoRA replication is also reported.
Significance. If the results hold, the work shows that small from-scratch models with curriculum replay and targeted SFT rebalancing can deliver functional performance on domain-specific tasks in low-resource languages at minimal cost (~$25 corpus compute). Credit is due for the controlled N=4 seed ablations with standard deviations, explicit replay-percentage sweeps, corpus-mix experiments, and reproducible low-compute pipeline details, which provide a concrete template for similar specialized LLM efforts.
major comments (2)
- [Abstract and Results] Abstract and Results: The central empirical claim—that rebalancing the SFT mixture to a 1:21 tool-use ratio raises B4 from a 0.000 floor to 0.230 +/- 0.052 at 42M parameters, treating the floor as a pure corpus-density artifact—rests on B4 (tool-selection) and B5 (conversational gate) being valid proxies for cybersecurity utility. No correlation to external tasks, established cybersecurity benchmarks, or downstream outcomes (e.g., API invocation accuracy or exploit generation) is reported. This assumption is load-bearing for interpreting the v7 configuration and the claim that capacity limits are not at play.
- [Results] LoRA replication paragraph: The 260M LoRA model reports B4 = 0.445 +/- 0.201 (N=4), exhibiting substantially larger variance than the main model's +/- 0.052. This high uncertainty weakens any inference about scaling benefits or generalization and requires explicit discussion or additional controls to support the comparison to the 42M headline result.
minor comments (2)
- [Abstract] Abstract: B1 is reported as 0.332 +/- 0.005 without definition; add a brief description of what B1 measures when first introduced in the main text.
- [Corpus] Corpus section: The offensive tooling phase (10M tokens) is described at high level but lacks specifics on data sources, collection method, or quality filters used in the eight-VM pipeline.
Simulated Author's Rebuttal
We appreciate the referee's recognition of our controlled N=4 seed ablations, replay sweeps, and reproducible low-compute pipeline. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results: The central empirical claim—that rebalancing the SFT mixture to a 1:21 tool-use ratio raises B4 from a 0.000 floor to 0.230 +/- 0.052 at 42M parameters, treating the floor as a pure corpus-density artifact—rests on B4 (tool-selection) and B5 (conversational gate) being valid proxies for cybersecurity utility. No correlation to external tasks, established cybersecurity benchmarks, or downstream outcomes (e.g., API invocation accuracy or exploit generation) is reported. This assumption is load-bearing for interpreting the v7 configuration and the claim that capacity limits are not at play.
Authors: B4 and B5 are internal evaluation metrics defined to measure the precise capabilities targeted by the work: B4 is the accuracy of correct MCP tool selection and formatting on held-out cybersecurity queries, while B5 measures whether the model appropriately gates tool use versus pure conversational responses. In the low-resource Spanish cybersecurity domain, no established external benchmarks exist that incorporate native MCP-style tool invocation. The rebalancing experiment isolates the effect of tool-use density on lifting B4 from its observed floor, supporting the corpus-density interpretation at this scale. We will add an explicit limitations paragraph acknowledging that these remain proxy metrics without direct correlation to downstream outcomes such as real API success rates or exploit generation. revision: partial
-
Referee: [Results] LoRA replication paragraph: The 260M LoRA model reports B4 = 0.445 +/- 0.201 (N=4), exhibiting substantially larger variance than the main model's +/- 0.052. This high uncertainty weakens any inference about scaling benefits or generalization and requires explicit discussion or additional controls to support the comparison to the 42M headline result.
Authors: We agree that the reported standard deviation of +/- 0.201 on the 260M LoRA B4 result is substantially larger than the main model's and reduces the strength of any scaling inference. This elevated variance is likely attributable to the smaller effective sample size for tool-use examples under LoRA adaptation. In revision we will qualify the LoRA paragraph to present the result strictly as a replication check, explicitly highlight the higher uncertainty, and refrain from drawing firm generalization conclusions from the comparison. revision: yes
Circularity Check
No significant circularity; empirical results are self-contained
full rationale
The paper reports direct empirical outcomes from training a 42M-parameter decoder-only model on a custom Spanish cybersecurity corpus using curriculum phases, replay, and SFT mixture rebalancing. Benchmark scores such as B4 = 0.230 +/- 0.052 and B5 = 0.725 +/- 0.130 are measured post-training under controlled ablations (e.g., replay percentages and corpus variants), not quantities that reduce by construction to the input mixture ratios or replay fractions via any equation or self-definition. No mathematical derivations, uniqueness theorems, or ansatzes are invoked; loss curves and benchmark lifts are presented as observed experimental results with explicit N=4 seed statistics. The central claims rest on these independent measurements rather than any load-bearing self-citation chain or fitted-input prediction, rendering the reported findings self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- replay percentage
- SFT tool-use ratio
axioms (1)
- domain assumption The chosen transformer components (GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss) are appropriate for a 42M decoder on this domain.
Reference graph
Works this paper leans on
- [1]
-
[2]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4895– 4901
work page 2023
-
[3]
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cody Blakeney, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Clémentine Lajau- nie, Giada Pistilli, Henri Larcher, Leandro von Werra, and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model.arXiv preprint arXiv:2502.02737(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic. com/news/model-context-protocol. Accessed: 2026-05-08
work page 2024
-
[5]
AI at Meta. 2024. The Llama 3 Herd of Models. https://ai.meta.com/blog/meta- llama-3/.Meta AI(2024)
work page 2024
-
[6]
Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2024. CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. ACM Transactions on Privacy and Security27, 2 (2024), 1–20
work page 2024
-
[7]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). 3615–3620
work page 2019
-
[8]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML). 41–48
work page 2009
-
[9]
BERTIN Project. 2023. Alpaca-Spanish: Spanish Translation of the Stanford Alpaca Dataset. https://huggingface.co/datasets/bertin-project/alpaca-spanish
work page 2023
-
[10]
José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC at ICLR 2020
work page 2020
-
[11]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau- rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. 2023. Palm 2 technical report.arXiv preprint arXiv:2305.10403(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451
work page 2020
-
[13]
Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InAdvances in Neural Information Processing Systems, Vol. 36
work page 2023
- [14]
-
[15]
Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.Proceedings of the 33rd USENIX Security Symposium(2024)
work page 2024
-
[16]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs.arXiv preprint arXiv:2305.14314 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171–4186
work page 2019
-
[18]
Eberhard, David M. and Simons, Gary F. and Fennig, Charles D. (eds.). 2023. Ethnologue: Languages of the World. https://www.ethnologue.com. Online resource
work page 2023
-
[19]
Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135
work page 1999
-
[20]
Gemma Team and Gemini Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.arXiv preprint arXiv:2403.08295(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Gerganov, Georgi and llama.cpp contributors. 2023. llama.cpp: LLM Inference in C/C++. https://github.com/ggerganov/llama.cpp
work page 2023
-
[22]
Gerganov, Georgi and the ggml contributors. 2024. GGUF: A Unified Binary Format for Quantized Language Models. https://github.com/ggerganov/ggml/ blob/master/docs/gguf.md. Accessed: 2026-05-08
work page 2024
-
[23]
Dirk Groeneveld, Iz Beltagy, Akshita Tsvigun, Ian Magnusson, Yada Wang, Han- naneh Nam, Dustin Schwenk, Mitchell Wortsman, Sameer Bhagia, Oyvind Anas, et al. 2024. OLMo: Accelerating the Science of Language Models.arXiv preprint arXiv:2402.00838(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Continual pre-training of large language models: How to (re) warm your model?,
Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual Pre-Training of Large Language Models: How to (re)warm your model?arXiv preprint arXiv:2308.04014(2023)
-
[25]
Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodríguez-Penagos, Aitor Gonzalez-Agirre, and Marta Villegas. 2022. MarIA: Spanish Language Models.Procesamiento del Lenguaje Natural68 (2022), 39–60
work page 2022
- [26]
- [27]
-
[28]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [30]
-
[31]
Andrej Karpathy. 2023. nanoGPT. https://github.com/karpathy/nanoGPT
work page 2023
-
[32]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526
work page 2017
-
[33]
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagy, et al. 2023. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327(2023)
-
[34]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). 66–71
work page 2018
-
[35]
Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large par- allel corpora from movie and tv subtitles. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 923–929. Juan S. Santillana
work page 2016
- [36]
-
[37]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[38]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InProceedings of the 7th International Conference on Learning Representations (ICLR)
work page 2019
-
[39]
National Institute of Standards and Technology. 2024. National Vulnerability Database (NVD). https://nvd.nist.gov. Accessed: 2026-05-08
work page 2024
-
[40]
Ollama Team. 2023. Ollama. https://ollama.com
work page 2023
-
[41]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS)
work page 2024
- [42]
-
[43]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun
-
[44]
InThe Twelfth International Conference on Learning Representations (ICLR)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations (ICLR)
- [45]
-
[46]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019)
work page 2019
-
[47]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36
work page 2023
-
[48]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36
work page 2023
-
[49]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1715–1725
work page 2016
-
[50]
Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[51]
Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum Learning: A Survey.International Journal of Computer Vision130, 6 (2022), 1526–1565
work page 2022
-
[52]
Blake E Strom, Andy Applebaum, Douglas P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. MITRE ATT&CK: Design and Philosophy. InProceedings of the 2018 ACM Workshop on Learning from Authoritative Security Data. 1–11
work page 2018
-
[53]
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yun Liu. 2024. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting568 (2024), 127063
work page 2024
-
[54]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in neural information processing systems, Vol. 30
work page 2017
- [56]
-
[57]
Le, Tengyu Ma, and Adams Wei Yu
Sang Michael Xie, Hieu Pham, Xinyun Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36
work page 2023
-
[58]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 483–498
work page 2021
-
[59]
Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019)
work page 2019
-
[60]
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: A small-scale, compute-efficient open-source large language model.arXiv preprint arXiv:2401.02385(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.