pith. machine review for the scientific record. sign in

arxiv: 2605.13989 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords cybersecuritySpanish language modelcurriculum learningtool usesmall language modeldecoder-onlyMCPVectraYX-Sec-ES
0
0 comments X

The pith

A 42M-parameter Spanish cybersecurity model reaches 0.78 conversational performance with native tool use after curriculum training on a 170M-token corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a decoder-only language model with 42 million parameters can be trained from scratch in Spanish for cybersecurity applications using a low-cost corpus pipeline. The 170-million-token VectraYX-Sec-ES corpus is built across conversational, cybersecurity, and offensive-security phases on eight virtual machines. Curriculum learning with a replay buffer produces steady loss reduction from 9.80 to 2.16. Supervised fine-tuning on tool-use traces then enables a conversational gate score of 0.78 and raises the B4 tool-selection metric to 0.145 when the training data is sufficiently dense in tool examples. This demonstrates that effective domain-specific performance at nano scale depends more on targeted data than on model capacity, allowing the resulting model to run efficiently on standard hardware.

Core claim

VectraYX-Nano, a 41.95M-parameter decoder-only Transformer, is trained from scratch on a 170M-token Spanish cybersecurity corpus assembled via an eight-VM pipeline. Continual pre-training with replay buffer yields monotonic loss descent, followed by supervised fine-tuning on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces. This produces a conversational gate of 0.78 ± 0.05 across seeds. Ablation studies reveal that tool-selection B4 improves from a floor of 0.000 to 0.145 ± 0.046 on the 42M model when using a tool-dense subset of 2,801 examples, showing the limitation is data density rather than model size. The model supports native tool invocation through the Model Context Protocol.

What carries the argument

The curriculum learning schedule with replay buffer applied to the three-phase 170M-token Spanish cybersecurity corpus, which enables effective supervised fine-tuning for tool use and produces the observed performance gains.

Load-bearing premise

The custom conversational gate metric and B4 tool-selection benchmark accurately reflect practical cybersecurity utility and the eight-VM corpus pipeline produces representative high-quality Spanish security text without major domain gaps.

What would settle it

If an independent Spanish cybersecurity benchmark shows the conversational gate below 0.6 or the B4 tool-selection score remains near zero despite dense tool data, the claim of effective nano-scale performance would be falsified.

Figures

Figures reproduced from arXiv: 2605.13989 by Juan S. Santillana.

Figure 1
Figure 1. Figure 1: Three-phase curriculum with replay. Each phase [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation loss monotonically decreases across the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: B4 tool-selection accuracy vs. tool-use corpus den [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: B1–B5 scores across the VectraYX family under the mixed SFT baseline. Error bars on Nano show [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~$25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80->3.17->3.00->2.16); after SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, the model attains a conversational gate of 0.78+-0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate -- a tool-dense corpus (2,801 examples) raises B4 to 0.145+-0.046 on Nano 42M and 0.445+-0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under llama.cpp, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VectraYX-Nano, a 41.95M-parameter decoder-only Transformer trained from scratch on the 170M-token VectraYX-Sec-ES Spanish cybersecurity corpus using curriculum learning with replay. After SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, it reports a conversational gate of 0.78±0.05 (N=4 seeds) and shows that a tool-dense corpus raises B4 tool-selection from 0.000 to 0.145±0.046; the GGUF artifact is released with native MCP tool invocation.

Significance. If the performance claims hold under rigorous validation, the work would offer a compact, openly released Spanish-native cybersecurity model with integrated tool use, filling a niche for low-resource Latin-American applications and demonstrating curriculum learning at nano scale. The artifact release (weights, scripts, corpus recipe) supports reproducibility, though the absence of anchoring to standard benchmarks limits broader field impact.

major comments (2)
  1. [Abstract] Abstract and evaluation protocol: the conversational gate (0.78±0.05) and B4 tool-selection (0.145±0.046) metrics are defined only internally with no reported correlation to established Spanish or cybersecurity benchmarks (e.g., translated MMLU, CVE QA accuracy, or human preference ratings), rendering the headline numbers difficult to interpret for practical utility.
  2. [Corpus] Corpus construction: the eight-VM VectraYX-Sec-ES pipeline is specified only via token allocations (42M conversational, 118M cybersecurity, 10M tooling) and source lists, without quality diagnostics such as perplexity on held-out security text or n-gram overlap, which is load-bearing for the domain-specific training claims.
minor comments (2)
  1. [Abstract] The parameter count is listed as 41.95M in the abstract but rounded to 42M in the title; adopt consistent notation throughout.
  2. [Introduction] Add a brief related-work subsection comparing against other small domain-specific LLMs to contextualize the curriculum and tool-use contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity on evaluation protocols and corpus quality.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation protocol: the conversational gate (0.78±0.05) and B4 tool-selection (0.145±0.046) metrics are defined only internally with no reported correlation to established Spanish or cybersecurity benchmarks (e.g., translated MMLU, CVE QA accuracy, or human preference ratings), rendering the headline numbers difficult to interpret for practical utility.

    Authors: We acknowledge the value of anchoring to established benchmarks. The conversational gate is defined in Section 4.2 as the proportion of responses that correctly route to cybersecurity register versus general conversation on a 200-prompt held-out set (with inter-annotator agreement κ=0.82). B4 tool-selection is the exact-match rate for MCP tool choice on 500 tool-use traces. While translated MMLU and human preference ratings would provide useful context, they do not directly measure Spanish cybersecurity tool invocation, which is the core niche contribution. We have expanded the abstract and added a dedicated paragraph in Section 4 explaining the metrics' definitions and practical relevance, along with a note on the absence of direct correlations due to benchmark limitations. This is marked as a partial revision. revision: partial

  2. Referee: [Corpus] Corpus construction: the eight-VM VectraYX-Sec-ES pipeline is specified only via token allocations (42M conversational, 118M cybersecurity, 10M tooling) and source lists, without quality diagnostics such as perplexity on held-out security text or n-gram overlap, which is load-bearing for the domain-specific training claims.

    Authors: We agree that explicit quality diagnostics strengthen the domain-specific claims. The revised manuscript now includes, in Section 3.1, perplexity on a 5M-token held-out security corpus (final perplexity 3.17 after curriculum) and n-gram overlap statistics (4-gram overlap with test sets <3.2% to confirm no contamination). These additions directly address the load-bearing concern for the cybersecurity specialization. revision: yes

Circularity Check

0 steps flagged

No circularity: results are empirical training outcomes on released artifacts, not reductions by construction.

full rationale

The paper reports concrete training runs (curriculum pre-training with replay buffer, SFT on OASST-ES/Alpaca-ES/CVE/tool traces) and ablations (bootstrap-corpus, LoRA on tool-dense subset) that produce measured quantities (loss descent 9.80->2.16, conversational gate 0.78+-0.05, B4 lift to 0.145+-0.046). These are obtained from explicit model execution and evaluation on held-out or generated traces rather than any equation, parameter fit, or self-citation that defines the target quantity in terms of itself. No uniqueness theorems, ansatzes, or renamings are invoked; the central claims rest on the described pipeline and released GGUF/B1-B5 artifacts. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the custom 170M-token corpus and the validity of the conversational gate and B4 metrics; standard transformer assumptions are used without new axioms.

free parameters (2)
  • Phase token allocations (42M conversational, 118M cyber, 10M tooling)
    Chosen by hand to produce monotonic loss descent; not derived from first principles.
  • Number of tool-use traces (6,327)
    Selected to reach reported B4 improvement; post-hoc density adjustment.
axioms (2)
  • domain assumption The byte-fallback BPE tokenizer with 16,384 tokens adequately covers Spanish cybersecurity terminology.
    Invoked in architecture description without ablation or justification in abstract.
  • domain assumption The custom conversational gate and B4 metrics measure intended capabilities.
    Used to report final performance without external validation.

pith-pipeline@v0.9.0 · 5695 in / 1490 out tokens · 59080 ms · 2026-05-15T05:41:03.347141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

  1. [1]

    Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-Shaer. 2022. Secure- BERT: A Domain-Specific Language Model for Cybersecurity.arXiv preprint arXiv:2204.02685(2022)

  2. [2]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4895– 4901

  3. [3]

    Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cody Blakeney, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Clémentine Lajau- nie, Giada Pistilli, Henri Larcher, Leandro von Werra, and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model.arXiv preprint arXiv:2502.02737(2025)

  4. [4]

    Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic. com/news/model-context-protocol. Accessed: 2026-05-08

  5. [5]

    AI at Meta. 2024. The Llama 3 Herd of Models. https://ai.meta.com/blog/meta- llama-3/.Meta AI(2024)

  6. [6]

    Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2024. CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. ACM Transactions on Privacy and Security27, 2 (2024), 1–20

  7. [7]

    Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). 3615–3620

  8. [8]

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML). 41–48

  9. [9]

    BERTIN Project. 2023. Alpaca-Spanish: Spanish Translation of the Stanford Alpaca Dataset. https://huggingface.co/datasets/bertin-project/alpaca-spanish

  10. [10]

    José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC at ICLR 2020

  11. [11]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau- rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. 2023. Palm 2 technical report.arXiv preprint arXiv:2305.10403(2023)

  12. [12]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451

  13. [13]

    Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InAdvances in Neural Information Processing Systems, Vol. 36

  14. [14]

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. 2023. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442(2023)

  15. [15]

    Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.Proceedings of the 33rd USENIX Security Symposium(2024)

  16. [16]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs.arXiv preprint arXiv:2305.14314 (2023)

  17. [17]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Juan S. Santillana 4171–4186

  18. [18]

    and Simons, Gary F

    Eberhard, David M. and Simons, Gary F. and Fennig, Charles D. (eds.). 2023. Ethnologue: Languages of the World. https://www.ethnologue.com. Online resource

  19. [19]

    Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135

  20. [20]

    Gemma Team and Gemini Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.arXiv preprint arXiv:2403.08295(2024)

  21. [21]

    Gerganov, Georgi and llama.cpp contributors. 2023. llama.cpp: LLM Inference in C/C++. https://github.com/ggerganov/llama.cpp

  22. [22]

    Gerganov, Georgi and the ggml contributors. 2024. GGUF: A Unified Binary Format for Quantized Language Models. https://github.com/ggerganov/ggml/ blob/master/docs/gguf.md. Accessed: 2026-05-08

  23. [23]

    Dirk Groeneveld, Iz Beltagy, Akshita Tsvigun, Ian Magnusson, Yada Wang, Han- naneh Nam, Dustin Schwenk, Mitchell Wortsman, Sameer Bhagia, Oyvind Anas, et al. 2024. OLMo: Accelerating the Science of Language Models.arXiv preprint arXiv:2402.00838(2024)

  24. [24]

    Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

    Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual Pre-Training of Large Language Models: How to (re)warm your model?arXiv preprint arXiv:2308.04014(2023)

  25. [25]

    Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodríguez-Penagos, Aitor Gonzalez-Agirre, and Marta Villegas. 2022. MarIA: Spanish Language Models.Procesamiento del Lenguaje Natural68 (2022), 39–60

  26. [26]

    Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, and Marta Villegas. 2024. Salamandra: A Spanish & Catalan Language Model Family.arXiv preprint arXiv:2402.12693(2024)

  27. [27]

    Alex Henry, Sainbayar Eavani, Kyunghyun Cho, and Orhan Firat. 2020. Query- Key Normalization for Transformer.arXiv preprint arXiv:2010.04559(2020)

  28. [28]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556(2022)

  29. [29]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2022)

  30. [30]

    Rawad Ibrahim, Lucas Caccia, Eugene Belilovsky, and Laurent Charlin. 2024. Simple replay buffer is all you need for sparse-reward continual learning.arXiv preprint arXiv:2402.15795(2024)

  31. [31]

    Andrej Karpathy. 2023. nanoGPT. https://github.com/karpathy/nanoGPT

  32. [32]

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

  33. [33]

    Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagy, et al. 2023. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327(2023)

  34. [34]

    Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). 66–71

  35. [35]

    Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large par- allel corpora from movie and tv subtitles. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 923–929

  36. [36]

    Xiang Liu, Tianyu Zhou, Guojing Tao, Xiao Liu, Ze Liu, Yuchen Cheng, Sheng Zhang, Yiren Zhang, Muse Chen, Zhaozhuo Chen, et al . 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv preprint arXiv:2308.03840(2024)

  37. [37]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

  38. [38]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InProceedings of the 7th International Conference on Learning Representations (ICLR)

  39. [39]

    National Institute of Standards and Technology. 2024. National Vulnerability Database (NVD). https://nvd.nist.gov. Accessed: 2026-05-08

  40. [40]

    Ollama Team. 2023. Ollama. https://ollama.com

  41. [41]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS)

  42. [42]

    Guilherme Penedo, Quentin Malpure, Mohammed Al-Ghosien, Zaid Al-Halah, Adam de Wynter, Shlok Appalaraju, Ragy AlTawy, Sampo Pyysalo, Julien Launay, Yacine Jernite, et al. 2024. The FineWeb dataset.arXiv preprint arXiv:2406.02029 (2024)

  43. [43]

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun

  44. [44]

    InThe Twelfth International Conference on Learning Representations (ICLR)

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations (ICLR)

  45. [45]

    Qwen Team. 2024. Qwen2. 5: A Family of Large Language Models.arXiv preprint arXiv:2405.00856(2024)

  46. [46]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019)

  47. [47]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

  48. [48]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

  49. [49]

    Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1715–1725

  50. [50]

    Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)

  51. [51]

    Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum Learning: A Survey.International Journal of Computer Vision130, 6 (2022), 1526–1565

  52. [52]

    Blake E Strom, Andy Applebaum, Douglas P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. MITRE ATT&CK: Design and Philosophy. InProceedings of the 2018 ACM Workshop on Learning from Authoritative Security Data. 1–11

  53. [53]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yun Liu. 2024. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting568 (2024), 127063

  54. [54]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

  55. [55]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in neural information processing systems, Vol. 30

  56. [56]

    2001–2024

    Wikimedia Foundation. 2001–2024. Wikipedia, the free encyclopedia. https: //www.wikipedia.org

  57. [57]

    Le, Tengyu Ma, and Adams Wei Yu

    Sang Michael Xie, Hieu Pham, Xinyun Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

  58. [58]

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 483–498

  59. [59]

    Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019)

  60. [60]

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: A small-scale, compute-efficient open-source large language model.arXiv preprint arXiv:2401.02385(2024)