VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

Juan S. Santillana

arxiv: 2605.13989 · v3 · pith:4PVIJ5BDnew · submitted 2026-05-13 · 💻 cs.CL

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

Juan S. Santillana This is my paper

Pith reviewed 2026-05-22 09:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords Spanish language modelcybersecuritycurriculum learningtool usesmall language modelsSFT rebalancingdecoder-only transformer

0 comments

The pith

Rebalancing the supervised fine-tuning mix toward tool-use examples lets a 42-million-parameter Spanish cybersecurity model reach 0.23 tool-selection accuracy while retaining conversational performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a compact decoder-only language model trained from scratch on a modest Spanish cybersecurity corpus can acquire native tool-invocation capability through curriculum learning and targeted data rebalancing. It assembles a 170-million-token corpus split into conversational, domain, and tooling phases, applies replay during training to maintain earlier skills, and shows that increasing the proportion of tool-use examples in the final fine-tuning stage overcomes a zero baseline on tool selection. If correct, this would mean that specialized technical models in non-English languages can be built and deployed on ordinary hardware without requiring either massive scale or English-centric pretraining.

Core claim

The central claim is that the zero floor previously observed on the tool-selection benchmark B4 is a corpus-density artifact rather than a capacity limit. After adjusting the SFT mixture to a 1:21 tool-use ratio, the 42M-parameter VectraYX-Nano v7 reaches B4 = 0.230 +/- 0.052, while holding B1 at 0.332 +/- 0.005 and B5 at 0.725 +/- 0.130; a LoRA adaptation of a 260M from-scratch model reaches 0.445 +/- 0.201 on the same benchmark. Curriculum replay across three phases produces monotonic loss reduction, and bootstrap-corpus ablations reveal that lower-perplexity general Spanish data harms conversational gate performance on B5.

What carries the argument

Rebalancing the SFT mixture to a 1:21 tool-use ratio, which supplies enough examples for the model to learn native tool invocation through the Model Context Protocol while preserving prior conversational and domain skills.

If this is right

Curriculum phases with replay produce steady loss improvement across conversational, cybersecurity, and tooling data.
A 42M-parameter model can run sub-second on commodity hardware once trained with the described architecture and data mix.
Spanish-native tool use becomes feasible at this scale without English pretraining or massive parameter counts.
The same rebalancing approach can be applied to other technical domains where tool invocation is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the density explanation holds, similar rebalancing could unlock tool use in other small models for low-resource languages.
Native MCP integration at 42M parameters opens the possibility of lightweight agentic workflows for regional cybersecurity teams.
The observed loss-versus-register inversion suggests that domain-specific register matters more than raw perplexity when selecting bootstrap corpora.

Load-bearing premise

The custom benchmarks B4 and B5 are treated as reliable proxies for actual cybersecurity utility in the real world.

What would settle it

Running the released model on a set of previously unseen real-world cybersecurity tool-selection and conversational tasks and finding that its accuracy remains near zero despite the reported B4 and B5 scores.

Figures

Figures reproduced from arXiv: 2605.13989 by Juan S. Santillana.

**Figure 2.** Figure 2: Validation loss monotonically decreases across the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: B4 tool-selection accuracy vs. tool-use corpus den [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 3.** Figure 3: B5 conversational gate as a function of Phase [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: B1–B5 scores across the VectraYX family under the mixed SFT baseline. Error bars on Nano show [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 4.** Figure 4: B4 tool-selection accuracy vs. tool-use corpus den [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 4.** Figure 4: B4 tool-selection accuracy vs. tool-use corpus den [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: B1–B5 scores across the VectraYX family under the mixed SFT baseline. Error bars on Nano show [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model has four contributions. (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus assembled by an eight-VM distributed pipeline at ~$25 USD of cloud compute and split into three curriculum phases (conversational 42M, cybersecurity 118M, offensive tooling 10M). (ii) Architecture: a 42M Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE and z-loss, paired with a domain-balanced 16,384-token byte-fallback BPE. (iii) Curriculum with replay across the three phases yields a monotonic loss descent (9.80 -> 3.17 -> 3.00 -> 2.16); after SFT (loss 1.74) the v2 bootstrap-ablation reference attains a conversational gate of 0.775 +/- 0.043 on B5 over N=4 seeds, and a controlled Phase-2 replay sweep over {0,5,10,25,50}% saturates B5 at >=25% replay. (iv) Two empirical findings, both N=4. A controlled bootstrap-corpus ablation across v2 (OpenSubs), v4 (mC4-ES), and v6 (60/25/15 OpenSubs/mC4/Wiki) exposes a loss-versus-register inversion: lower-perplexity bootstraps yield measurably worse conversational behavior (v2 > v4 > v6 on B5 at every paired seed). The B4 (tool-selection) floor of 0.000 is a corpus-density artifact, not a capacity gate: rebalancing the SFT mixture to tool-use ratio 1:21 yields VectraYX-Nano v7, the released headline configuration, reaching B4 = 0.230 +/- 0.052 at 42M while retaining B1 = 0.332 +/- 0.005 and B5 = 0.725 +/- 0.130; a LoRA replication on a 260M from-scratch mid-tier reaches 0.445 +/- 0.201. The released GGUF is 96 MB in F16, runs sub-second TTFT on commodity hardware under llama.cpp, and is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents VectraYX-Nano, a 41.95M-parameter decoder-only Transformer trained from scratch on the 170M-token VectraYX-Sec-ES Spanish cybersecurity corpus assembled at low cost. It uses a three-phase curriculum (conversational 42M, cybersecurity 118M, offensive tooling 10M) with replay, GQA/RoPE/SwiGLU architecture, and SFT with rebalanced tool-use ratios to enable native MCP tool invocation. Empirical results from N=4 seed runs show monotonic loss descent, a loss-versus-register inversion across bootstrap corpora, saturation of B5 at >=25% replay, and that a 1:21 tool-use SFT ratio lifts B4 from 0.000 to 0.230 +/- 0.052 while retaining B5 = 0.725 +/- 0.130; a 260M LoRA replication is also reported.

Significance. If the results hold, the work shows that small from-scratch models with curriculum replay and targeted SFT rebalancing can deliver functional performance on domain-specific tasks in low-resource languages at minimal cost (~$25 corpus compute). Credit is due for the controlled N=4 seed ablations with standard deviations, explicit replay-percentage sweeps, corpus-mix experiments, and reproducible low-compute pipeline details, which provide a concrete template for similar specialized LLM efforts.

major comments (2)

[Abstract and Results] Abstract and Results: The central empirical claim—that rebalancing the SFT mixture to a 1:21 tool-use ratio raises B4 from a 0.000 floor to 0.230 +/- 0.052 at 42M parameters, treating the floor as a pure corpus-density artifact—rests on B4 (tool-selection) and B5 (conversational gate) being valid proxies for cybersecurity utility. No correlation to external tasks, established cybersecurity benchmarks, or downstream outcomes (e.g., API invocation accuracy or exploit generation) is reported. This assumption is load-bearing for interpreting the v7 configuration and the claim that capacity limits are not at play.
[Results] LoRA replication paragraph: The 260M LoRA model reports B4 = 0.445 +/- 0.201 (N=4), exhibiting substantially larger variance than the main model's +/- 0.052. This high uncertainty weakens any inference about scaling benefits or generalization and requires explicit discussion or additional controls to support the comparison to the 42M headline result.

minor comments (2)

[Abstract] Abstract: B1 is reported as 0.332 +/- 0.005 without definition; add a brief description of what B1 measures when first introduced in the main text.
[Corpus] Corpus section: The offensive tooling phase (10M tokens) is described at high level but lacks specifics on data sources, collection method, or quality filters used in the eight-VM pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's recognition of our controlled N=4 seed ablations, replay sweeps, and reproducible low-compute pipeline. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The central empirical claim—that rebalancing the SFT mixture to a 1:21 tool-use ratio raises B4 from a 0.000 floor to 0.230 +/- 0.052 at 42M parameters, treating the floor as a pure corpus-density artifact—rests on B4 (tool-selection) and B5 (conversational gate) being valid proxies for cybersecurity utility. No correlation to external tasks, established cybersecurity benchmarks, or downstream outcomes (e.g., API invocation accuracy or exploit generation) is reported. This assumption is load-bearing for interpreting the v7 configuration and the claim that capacity limits are not at play.

Authors: B4 and B5 are internal evaluation metrics defined to measure the precise capabilities targeted by the work: B4 is the accuracy of correct MCP tool selection and formatting on held-out cybersecurity queries, while B5 measures whether the model appropriately gates tool use versus pure conversational responses. In the low-resource Spanish cybersecurity domain, no established external benchmarks exist that incorporate native MCP-style tool invocation. The rebalancing experiment isolates the effect of tool-use density on lifting B4 from its observed floor, supporting the corpus-density interpretation at this scale. We will add an explicit limitations paragraph acknowledging that these remain proxy metrics without direct correlation to downstream outcomes such as real API success rates or exploit generation. revision: partial
Referee: [Results] LoRA replication paragraph: The 260M LoRA model reports B4 = 0.445 +/- 0.201 (N=4), exhibiting substantially larger variance than the main model's +/- 0.052. This high uncertainty weakens any inference about scaling benefits or generalization and requires explicit discussion or additional controls to support the comparison to the 42M headline result.

Authors: We agree that the reported standard deviation of +/- 0.201 on the 260M LoRA B4 result is substantially larger than the main model's and reduces the strength of any scaling inference. This elevated variance is likely attributable to the smaller effective sample size for tool-use examples under LoRA adaptation. In revision we will qualify the LoRA paragraph to present the result strictly as a replication check, explicitly highlight the higher uncertainty, and refrain from drawing firm generalization conclusions from the comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper reports direct empirical outcomes from training a 42M-parameter decoder-only model on a custom Spanish cybersecurity corpus using curriculum phases, replay, and SFT mixture rebalancing. Benchmark scores such as B4 = 0.230 +/- 0.052 and B5 = 0.725 +/- 0.130 are measured post-training under controlled ablations (e.g., replay percentages and corpus variants), not quantities that reduce by construction to the input mixture ratios or replay fractions via any equation or self-definition. No mathematical derivations, uniqueness theorems, or ansatzes are invoked; loss curves and benchmark lifts are presented as observed experimental results with explicit N=4 seed statistics. The central claims rest on these independent measurements rather than any load-bearing self-citation chain or fitted-input prediction, rendering the reported findings self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical training runs whose success depends on several tuned hyperparameters and the representativeness of the assembled corpus; no new physical or mathematical entities are postulated.

free parameters (2)

replay percentage
Swept over {0,5,10,25,50}% and selected at >=25% to saturate B5 performance.
SFT tool-use ratio
Set to 1:21 after observing B4 floor of 0.000 on default mixture.

axioms (1)

domain assumption The chosen transformer components (GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss) are appropriate for a 42M decoder on this domain.
Adopted without additional justification or ablation against alternatives.

pith-pipeline@v0.9.0 · 6029 in / 1545 out tokens · 54520 ms · 2026-05-22T09:25:55.677944+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 11 internal anchors

[1]

Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-Shaer. 2022. Secure- BERT: A Domain-Specific Language Model for Cybersecurity.arXiv preprint arXiv:2204.02685(2022)

work page arXiv 2022
[2]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4895– 4901

work page 2023
[3]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cody Blakeney, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Clémentine Lajau- nie, Giada Pistilli, Henri Larcher, Leandro von Werra, and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model.arXiv preprint arXiv:2502.02737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic. com/news/model-context-protocol. Accessed: 2026-05-08

work page 2024
[5]

AI at Meta. 2024. The Llama 3 Herd of Models. https://ai.meta.com/blog/meta- llama-3/.Meta AI(2024)

work page 2024
[6]

Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2024. CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. ACM Transactions on Privacy and Security27, 2 (2024), 1–20

work page 2024
[7]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). 3615–3620

work page 2019
[8]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML). 41–48

work page 2009
[9]

BERTIN Project. 2023. Alpaca-Spanish: Spanish Translation of the Stanford Alpaca Dataset. https://huggingface.co/datasets/bertin-project/alpaca-spanish

work page 2023
[10]

José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC at ICLR 2020

work page 2020
[11]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau- rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. 2023. Palm 2 technical report.arXiv preprint arXiv:2305.10403(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451

work page 2020
[13]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InAdvances in Neural Information Processing Systems, Vol. 36

work page 2023
[14]

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. 2023. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442(2023)

work page arXiv 2023
[15]

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.Proceedings of the 33rd USENIX Security Symposium(2024)

work page 2024
[16]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs.arXiv preprint arXiv:2305.14314 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171–4186

work page 2019
[18]

and Simons, Gary F

Eberhard, David M. and Simons, Gary F. and Fennig, Charles D. (eds.). 2023. Ethnologue: Languages of the World. https://www.ethnologue.com. Online resource

work page 2023
[19]

Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135

work page 1999
[20]

Gemma Team and Gemini Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.arXiv preprint arXiv:2403.08295(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Gerganov, Georgi and llama.cpp contributors. 2023. llama.cpp: LLM Inference in C/C++. https://github.com/ggerganov/llama.cpp

work page 2023
[22]

Gerganov, Georgi and the ggml contributors. 2024. GGUF: A Unified Binary Format for Quantized Language Models. https://github.com/ggerganov/ggml/ blob/master/docs/gguf.md. Accessed: 2026-05-08

work page 2024
[23]

Dirk Groeneveld, Iz Beltagy, Akshita Tsvigun, Ian Magnusson, Yada Wang, Han- naneh Nam, Dustin Schwenk, Mitchell Wortsman, Sameer Bhagia, Oyvind Anas, et al. 2024. OLMo: Accelerating the Science of Language Models.arXiv preprint arXiv:2402.00838(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Continual pre-training of large language models: How to (re) warm your model?,

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual Pre-Training of Large Language Models: How to (re)warm your model?arXiv preprint arXiv:2308.04014(2023)

work page arXiv 2023
[25]

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodríguez-Penagos, Aitor Gonzalez-Agirre, and Marta Villegas. 2022. MarIA: Spanish Language Models.Procesamiento del Lenguaje Natural68 (2022), 39–60

work page 2022
[26]

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, and Marta Villegas. 2024. Salamandra: A Spanish & Catalan Language Model Family.arXiv preprint arXiv:2402.12693(2024)

work page arXiv 2024
[27]

Alex Henry, Sainbayar Eavani, Kyunghyun Cho, and Orhan Firat. 2020. Query- Key Normalization for Transformer.arXiv preprint arXiv:2010.04559(2020)

work page arXiv 2020
[28]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Rawad Ibrahim, Lucas Caccia, Eugene Belilovsky, and Laurent Charlin. 2024. Simple replay buffer is all you need for sparse-reward continual learning.arXiv preprint arXiv:2402.15795(2024)

work page arXiv 2024
[31]

Andrej Karpathy. 2023. nanoGPT. https://github.com/karpathy/nanoGPT

work page 2023
[32]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

work page 2017
[33]

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagy, et al. 2023. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327(2023)

work page arXiv 2023
[34]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). 66–71

work page 2018
[35]

Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large par- allel corpora from movie and tv subtitles. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 923–929. Juan S. Santillana

work page 2016
[36]

Xiang Liu, Tianyu Zhou, Guojing Tao, Xiao Liu, Ze Liu, Yuchen Cheng, Sheng Zhang, Yiren Zhang, Muse Chen, Zhaozhuo Chen, et al . 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv preprint arXiv:2308.03840(2024)

work page arXiv 2024
[37]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InProceedings of the 7th International Conference on Learning Representations (ICLR)

work page 2019
[39]

National Institute of Standards and Technology. 2024. National Vulnerability Database (NVD). https://nvd.nist.gov. Accessed: 2026-05-08

work page 2024
[40]

Ollama Team. 2023. Ollama. https://ollama.com

work page 2023
[41]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024
[42]

Guilherme Penedo, Quentin Malpure, Mohammed Al-Ghosien, Zaid Al-Halah, Adam de Wynter, Shlok Appalaraju, Ragy AlTawy, Sampo Pyysalo, Julien Launay, Yacine Jernite, et al. 2024. The FineWeb dataset.arXiv preprint arXiv:2406.02029 (2024)

work page arXiv 2024
[43]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun

work page
[44]

InThe Twelfth International Conference on Learning Representations (ICLR)

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations (ICLR)

work page
[45]

Qwen Team. 2024. Qwen2. 5: A Family of Large Language Models.arXiv preprint arXiv:2405.00856(2024)

work page arXiv 2024
[46]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019)

work page 2019
[47]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023
[48]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023
[49]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1715–1725

work page 2016
[50]

Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[51]

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum Learning: A Survey.International Journal of Computer Vision130, 6 (2022), 1526–1565

work page 2022
[52]

Blake E Strom, Andy Applebaum, Douglas P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. MITRE ATT&CK: Design and Philosophy. InProceedings of the 2018 ACM Workshop on Learning from Authoritative Security Data. 1–11

work page 2018
[53]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yun Liu. 2024. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting568 (2024), 127063

work page 2024
[54]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in neural information processing systems, Vol. 30

work page 2017
[56]

2001–2024

Wikimedia Foundation. 2001–2024. Wikipedia, the free encyclopedia. https: //www.wikipedia.org

work page 2001
[57]

Le, Tengyu Ma, and Adams Wei Yu

Sang Michael Xie, Hieu Pham, Xinyun Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023
[58]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 483–498

work page 2021
[59]

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019)

work page 2019
[60]

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: A small-scale, compute-efficient open-source large language model.arXiv preprint arXiv:2401.02385(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-Shaer. 2022. Secure- BERT: A Domain-Specific Language Model for Cybersecurity.arXiv preprint arXiv:2204.02685(2022)

work page arXiv 2022

[2] [2]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4895– 4901

work page 2023

[3] [3]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cody Blakeney, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Clémentine Lajau- nie, Giada Pistilli, Henri Larcher, Leandro von Werra, and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model.arXiv preprint arXiv:2502.02737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic. com/news/model-context-protocol. Accessed: 2026-05-08

work page 2024

[5] [5]

AI at Meta. 2024. The Llama 3 Herd of Models. https://ai.meta.com/blog/meta- llama-3/.Meta AI(2024)

work page 2024

[6] [6]

Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2024. CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. ACM Transactions on Privacy and Security27, 2 (2024), 1–20

work page 2024

[7] [7]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). 3615–3620

work page 2019

[8] [8]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML). 41–48

work page 2009

[9] [9]

BERTIN Project. 2023. Alpaca-Spanish: Spanish Translation of the Stanford Alpaca Dataset. https://huggingface.co/datasets/bertin-project/alpaca-spanish

work page 2023

[10] [10]

José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC at ICLR 2020

work page 2020

[11] [11]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau- rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. 2023. Palm 2 technical report.arXiv preprint arXiv:2305.10403(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451

work page 2020

[13] [13]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InAdvances in Neural Information Processing Systems, Vol. 36

work page 2023

[14] [14]

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. 2023. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442(2023)

work page arXiv 2023

[15] [15]

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.Proceedings of the 33rd USENIX Security Symposium(2024)

work page 2024

[16] [16]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs.arXiv preprint arXiv:2305.14314 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171–4186

work page 2019

[18] [18]

and Simons, Gary F

Eberhard, David M. and Simons, Gary F. and Fennig, Charles D. (eds.). 2023. Ethnologue: Languages of the World. https://www.ethnologue.com. Online resource

work page 2023

[19] [19]

Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135

work page 1999

[20] [20]

Gemma Team and Gemini Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.arXiv preprint arXiv:2403.08295(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Gerganov, Georgi and llama.cpp contributors. 2023. llama.cpp: LLM Inference in C/C++. https://github.com/ggerganov/llama.cpp

work page 2023

[22] [22]

Gerganov, Georgi and the ggml contributors. 2024. GGUF: A Unified Binary Format for Quantized Language Models. https://github.com/ggerganov/ggml/ blob/master/docs/gguf.md. Accessed: 2026-05-08

work page 2024

[23] [23]

Dirk Groeneveld, Iz Beltagy, Akshita Tsvigun, Ian Magnusson, Yada Wang, Han- naneh Nam, Dustin Schwenk, Mitchell Wortsman, Sameer Bhagia, Oyvind Anas, et al. 2024. OLMo: Accelerating the Science of Language Models.arXiv preprint arXiv:2402.00838(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Continual pre-training of large language models: How to (re) warm your model?,

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual Pre-Training of Large Language Models: How to (re)warm your model?arXiv preprint arXiv:2308.04014(2023)

work page arXiv 2023

[25] [25]

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodríguez-Penagos, Aitor Gonzalez-Agirre, and Marta Villegas. 2022. MarIA: Spanish Language Models.Procesamiento del Lenguaje Natural68 (2022), 39–60

work page 2022

[26] [26]

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, and Marta Villegas. 2024. Salamandra: A Spanish & Catalan Language Model Family.arXiv preprint arXiv:2402.12693(2024)

work page arXiv 2024

[27] [27]

Alex Henry, Sainbayar Eavani, Kyunghyun Cho, and Orhan Firat. 2020. Query- Key Normalization for Transformer.arXiv preprint arXiv:2010.04559(2020)

work page arXiv 2020

[28] [28]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Rawad Ibrahim, Lucas Caccia, Eugene Belilovsky, and Laurent Charlin. 2024. Simple replay buffer is all you need for sparse-reward continual learning.arXiv preprint arXiv:2402.15795(2024)

work page arXiv 2024

[31] [31]

Andrej Karpathy. 2023. nanoGPT. https://github.com/karpathy/nanoGPT

work page 2023

[32] [32]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

work page 2017

[33] [33]

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagy, et al. 2023. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327(2023)

work page arXiv 2023

[34] [34]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). 66–71

work page 2018

[35] [35]

Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large par- allel corpora from movie and tv subtitles. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 923–929. Juan S. Santillana

work page 2016

[36] [36]

Xiang Liu, Tianyu Zhou, Guojing Tao, Xiao Liu, Ze Liu, Yuchen Cheng, Sheng Zhang, Yiren Zhang, Muse Chen, Zhaozhuo Chen, et al . 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv preprint arXiv:2308.03840(2024)

work page arXiv 2024

[37] [37]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[38] [38]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InProceedings of the 7th International Conference on Learning Representations (ICLR)

work page 2019

[39] [39]

National Institute of Standards and Technology. 2024. National Vulnerability Database (NVD). https://nvd.nist.gov. Accessed: 2026-05-08

work page 2024

[40] [40]

Ollama Team. 2023. Ollama. https://ollama.com

work page 2023

[41] [41]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024

[42] [42]

Guilherme Penedo, Quentin Malpure, Mohammed Al-Ghosien, Zaid Al-Halah, Adam de Wynter, Shlok Appalaraju, Ragy AlTawy, Sampo Pyysalo, Julien Launay, Yacine Jernite, et al. 2024. The FineWeb dataset.arXiv preprint arXiv:2406.02029 (2024)

work page arXiv 2024

[43] [43]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun

work page

[44] [44]

InThe Twelfth International Conference on Learning Representations (ICLR)

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations (ICLR)

work page

[45] [45]

Qwen Team. 2024. Qwen2. 5: A Family of Large Language Models.arXiv preprint arXiv:2405.00856(2024)

work page arXiv 2024

[46] [46]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019)

work page 2019

[47] [47]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023

[48] [48]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023

[49] [49]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1715–1725

work page 2016

[50] [50]

Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[51] [51]

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum Learning: A Survey.International Journal of Computer Vision130, 6 (2022), 1526–1565

work page 2022

[52] [52]

Blake E Strom, Andy Applebaum, Douglas P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. MITRE ATT&CK: Design and Philosophy. InProceedings of the 2018 ACM Workshop on Learning from Authoritative Security Data. 1–11

work page 2018

[53] [53]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yun Liu. 2024. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting568 (2024), 127063

work page 2024

[54] [54]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in neural information processing systems, Vol. 30

work page 2017

[56] [56]

2001–2024

Wikimedia Foundation. 2001–2024. Wikipedia, the free encyclopedia. https: //www.wikipedia.org

work page 2001

[57] [57]

Le, Tengyu Ma, and Adams Wei Yu

Sang Michael Xie, Hieu Pham, Xinyun Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023

[58] [58]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 483–498

work page 2021

[59] [59]

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019)

work page 2019

[60] [60]

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: A small-scale, compute-efficient open-source large language model.arXiv preprint arXiv:2401.02385(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024