arxiv: 2605.13989 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

Juan S. Santillana

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords cybersecuritySpanish language modelcurriculum learningtool usesmall language modeldecoder-onlyMCPVectraYX-Sec-ES

0 comments

The pith

A 42M-parameter Spanish cybersecurity model reaches 0.78 conversational performance with native tool use after curriculum training on a 170M-token corpus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a decoder-only language model with 42 million parameters can be trained from scratch in Spanish for cybersecurity applications using a low-cost corpus pipeline. The 170-million-token VectraYX-Sec-ES corpus is built across conversational, cybersecurity, and offensive-security phases on eight virtual machines. Curriculum learning with a replay buffer produces steady loss reduction from 9.80 to 2.16. Supervised fine-tuning on tool-use traces then enables a conversational gate score of 0.78 and raises the B4 tool-selection metric to 0.145 when the training data is sufficiently dense in tool examples. This demonstrates that effective domain-specific performance at nano scale depends more on targeted data than on model capacity, allowing the resulting model to run efficiently on standard hardware.

Core claim

VectraYX-Nano, a 41.95M-parameter decoder-only Transformer, is trained from scratch on a 170M-token Spanish cybersecurity corpus assembled via an eight-VM pipeline. Continual pre-training with replay buffer yields monotonic loss descent, followed by supervised fine-tuning on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces. This produces a conversational gate of 0.78 ± 0.05 across seeds. Ablation studies reveal that tool-selection B4 improves from a floor of 0.000 to 0.145 ± 0.046 on the 42M model when using a tool-dense subset of 2,801 examples, showing the limitation is data density rather than model size. The model supports native tool invocation through the Model Context Protocol.

What carries the argument

The curriculum learning schedule with replay buffer applied to the three-phase 170M-token Spanish cybersecurity corpus, which enables effective supervised fine-tuning for tool use and produces the observed performance gains.

Load-bearing premise

The custom conversational gate metric and B4 tool-selection benchmark accurately reflect practical cybersecurity utility and the eight-VM corpus pipeline produces representative high-quality Spanish security text without major domain gaps.

What would settle it

If an independent Spanish cybersecurity benchmark shows the conversational gate below 0.6 or the B4 tool-selection score remains near zero despite dense tool data, the claim of effective nano-scale performance would be falsified.

Figures

Figures reproduced from arXiv: 2605.13989 by Juan S. Santillana.

**Figure 2.** Figure 2: Validation loss monotonically decreases across the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: B4 tool-selection accuracy vs. tool-use corpus den [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: B1–B5 scores across the VectraYX family under the mixed SFT baseline. Error bars on Nano show [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~$25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80->3.17->3.00->2.16); after SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, the model attains a conversational gate of 0.78+-0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate -- a tool-dense corpus (2,801 examples) raises B4 to 0.145+-0.046 on Nano 42M and 0.445+-0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under llama.cpp, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VectraYX-Nano is a practical release of a 42M Spanish cybersecurity model with full artifacts and tool integration, but its custom metrics and limited corpus diagnostics make the performance numbers hard to interpret on their own.

read the letter

The main thing to know is that this paper ships a runnable 42M-parameter Spanish cybersecurity model with native MCP tool calling, plus the full corpus recipe, training code, and GGUF weights. At nano scale for a specific language-domain pair, that combination of released artifacts and end-to-end tool traces is the real contribution. The curriculum with replay buffer produces clean loss descent across phases, and the ablation showing tool performance tracks corpus density rather than raw capacity is a clear, useful observation at this size. They also demonstrate the same pattern holds when scaling the LoRA study to a 260M model, which adds some weight to the finding. The cheap eight-VM pipeline for the 170M-token corpus is a straightforward engineering detail that others can replicate directly. Overall the work is reproducible and focused on a genuine gap in non-English domain models. The soft spots sit in evaluation. The conversational gate at 0.78 and B4 tool score at 0.145 are defined only inside the paper with no reported ties to standard Spanish benchmarks, translated MMLU, CVE-specific accuracy, or human preference ratings. The corpus sources are listed by token count, but there are no perplexity checks on held-out security text, n-gram overlap diagnostics, or manual quality review, so domain coverage and Spanish fidelity remain unverified. Those gaps mean the headline numbers are plausible but not yet anchored to external utility. This paper is for engineers and researchers who want a small, local Spanish security model they can actually run and extend rather than for people chasing broad SOTA results. It deserves peer review because the training runs, ablations, and artifact releases give referees concrete material to examine even if the evaluation section will need strengthening.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VectraYX-Nano, a 41.95M-parameter decoder-only Transformer trained from scratch on the 170M-token VectraYX-Sec-ES Spanish cybersecurity corpus using curriculum learning with replay. After SFT on OASST-ES, Alpaca-ES, CVE Q&A, and 6,327 tool-use traces, it reports a conversational gate of 0.78±0.05 (N=4 seeds) and shows that a tool-dense corpus raises B4 tool-selection from 0.000 to 0.145±0.046; the GGUF artifact is released with native MCP tool invocation.

Significance. If the performance claims hold under rigorous validation, the work would offer a compact, openly released Spanish-native cybersecurity model with integrated tool use, filling a niche for low-resource Latin-American applications and demonstrating curriculum learning at nano scale. The artifact release (weights, scripts, corpus recipe) supports reproducibility, though the absence of anchoring to standard benchmarks limits broader field impact.

major comments (2)

[Abstract] Abstract and evaluation protocol: the conversational gate (0.78±0.05) and B4 tool-selection (0.145±0.046) metrics are defined only internally with no reported correlation to established Spanish or cybersecurity benchmarks (e.g., translated MMLU, CVE QA accuracy, or human preference ratings), rendering the headline numbers difficult to interpret for practical utility.
[Corpus] Corpus construction: the eight-VM VectraYX-Sec-ES pipeline is specified only via token allocations (42M conversational, 118M cybersecurity, 10M tooling) and source lists, without quality diagnostics such as perplexity on held-out security text or n-gram overlap, which is load-bearing for the domain-specific training claims.

minor comments (2)

[Abstract] The parameter count is listed as 41.95M in the abstract but rounded to 42M in the title; adopt consistent notation throughout.
[Introduction] Add a brief related-work subsection comparing against other small domain-specific LLMs to contextualize the curriculum and tool-use contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity on evaluation protocols and corpus quality.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation protocol: the conversational gate (0.78±0.05) and B4 tool-selection (0.145±0.046) metrics are defined only internally with no reported correlation to established Spanish or cybersecurity benchmarks (e.g., translated MMLU, CVE QA accuracy, or human preference ratings), rendering the headline numbers difficult to interpret for practical utility.

Authors: We acknowledge the value of anchoring to established benchmarks. The conversational gate is defined in Section 4.2 as the proportion of responses that correctly route to cybersecurity register versus general conversation on a 200-prompt held-out set (with inter-annotator agreement κ=0.82). B4 tool-selection is the exact-match rate for MCP tool choice on 500 tool-use traces. While translated MMLU and human preference ratings would provide useful context, they do not directly measure Spanish cybersecurity tool invocation, which is the core niche contribution. We have expanded the abstract and added a dedicated paragraph in Section 4 explaining the metrics' definitions and practical relevance, along with a note on the absence of direct correlations due to benchmark limitations. This is marked as a partial revision. revision: partial
Referee: [Corpus] Corpus construction: the eight-VM VectraYX-Sec-ES pipeline is specified only via token allocations (42M conversational, 118M cybersecurity, 10M tooling) and source lists, without quality diagnostics such as perplexity on held-out security text or n-gram overlap, which is load-bearing for the domain-specific training claims.

Authors: We agree that explicit quality diagnostics strengthen the domain-specific claims. The revised manuscript now includes, in Section 3.1, perplexity on a 5M-token held-out security corpus (final perplexity 3.17 after curriculum) and n-gram overlap statistics (4-gram overlap with test sets <3.2% to confirm no contamination). These additions directly address the load-bearing concern for the cybersecurity specialization. revision: yes

Circularity Check

0 steps flagged

No circularity: results are empirical training outcomes on released artifacts, not reductions by construction.

full rationale

The paper reports concrete training runs (curriculum pre-training with replay buffer, SFT on OASST-ES/Alpaca-ES/CVE/tool traces) and ablations (bootstrap-corpus, LoRA on tool-dense subset) that produce measured quantities (loss descent 9.80->2.16, conversational gate 0.78+-0.05, B4 lift to 0.145+-0.046). These are obtained from explicit model execution and evaluation on held-out or generated traces rather than any equation, parameter fit, or self-citation that defines the target quantity in terms of itself. No uniqueness theorems, ansatzes, or renamings are invoked; the central claims rest on the described pipeline and released GGUF/B1-B5 artifacts. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the custom 170M-token corpus and the validity of the conversational gate and B4 metrics; standard transformer assumptions are used without new axioms.

free parameters (2)

Phase token allocations (42M conversational, 118M cyber, 10M tooling)
Chosen by hand to produce monotonic loss descent; not derived from first principles.
Number of tool-use traces (6,327)
Selected to reach reported B4 improvement; post-hoc density adjustment.

axioms (2)

domain assumption The byte-fallback BPE tokenizer with 16,384 tokens adequately covers Spanish cybersecurity terminology.
Invoked in architecture description without ablation or justification in abstract.
domain assumption The custom conversational gate and B4 metrics measure intended capabilities.
Used to report final performance without external validation.

pith-pipeline@v0.9.0 · 5695 in / 1490 out tokens · 59080 ms · 2026-05-15T05:41:03.347141+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-phase curriculum (conversational → cybersecurity → tooling) with explicit replay buffers... loss descent (9.80 → 3.17 → 3.00 → 2.16)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

B4 tool-selection floor of 0.000 is a corpus-density artifact... tool-dense corpus (2,801 examples) raises B4 to 0.145±0.046

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

[1]

Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-Shaer. 2022. Secure- BERT: A Domain-Specific Language Model for Cybersecurity.arXiv preprint arXiv:2204.02685(2022)

work page arXiv 2022
[2]

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4895– 4901

work page 2023
[3]

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Cody Blakeney, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Clémentine Lajau- nie, Giada Pistilli, Henri Larcher, Leandro von Werra, and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model.arXiv preprint arXiv:2502.02737(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic. com/news/model-context-protocol. Accessed: 2026-05-08

work page 2024
[5]

AI at Meta. 2024. The Llama 3 Herd of Models. https://ai.meta.com/blog/meta- llama-3/.Meta AI(2024)

work page 2024
[6]

Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, and Christian Reuter. 2024. CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain. ACM Transactions on Privacy and Security27, 2 (2024), 1–20

work page 2024
[7]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). 3615–3620

work page 2019
[8]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. InProceedings of the 26th Annual International Conference on Machine Learning (ICML). 41–48

work page 2009
[9]

BERTIN Project. 2023. Alpaca-Spanish: Spanish Translation of the Stanford Alpaca Dataset. https://huggingface.co/datasets/bertin-project/alpaca-spanish

work page 2023
[10]

José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish Pre-Trained BERT Model and Evaluation Data. In PML4DC at ICLR 2020

work page 2020
[11]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau- rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sut- ton, Sebastian Gehrmann, et al. 2023. Palm 2 technical report.arXiv preprint arXiv:2305.10403(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learn- ing at Scale. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 8440–8451

work page 2020
[13]

Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. InAdvances in Neural Information Processing Systems, Vol. 36

work page 2023
[14]

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. 2023. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442(2023)

work page arXiv 2023
[15]

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2024. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.Proceedings of the 33rd USENIX Security Symposium(2024)

work page 2024
[16]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs.arXiv preprint arXiv:2305.14314 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Juan S. Santillana 4171–4186

work page 2019
[18]

and Simons, Gary F

Eberhard, David M. and Simons, Gary F. and Fennig, Charles D. (eds.). 2023. Ethnologue: Languages of the World. https://www.ethnologue.com. Online resource

work page 2023
[19]

Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135

work page 1999
[20]

Gemma Team and Gemini Team. 2024. Gemma: Open Models Based on Gemini Research and Technology.arXiv preprint arXiv:2403.08295(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Gerganov, Georgi and llama.cpp contributors. 2023. llama.cpp: LLM Inference in C/C++. https://github.com/ggerganov/llama.cpp

work page 2023
[22]

Gerganov, Georgi and the ggml contributors. 2024. GGUF: A Unified Binary Format for Quantized Language Models. https://github.com/ggerganov/ggml/ blob/master/docs/gguf.md. Accessed: 2026-05-08

work page 2024
[23]

Dirk Groeneveld, Iz Beltagy, Akshita Tsvigun, Ian Magnusson, Yada Wang, Han- naneh Nam, Dustin Schwenk, Mitchell Wortsman, Sameer Bhagia, Oyvind Anas, et al. 2024. OLMo: Accelerating the Science of Language Models.arXiv preprint arXiv:2402.00838(2024)

work page arXiv 2024
[24]

Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort

Kshitij Gupta, Benjamin Thérien, Adam Ibrahim, Mats L. Richter, Quentin An- thony, Eugene Belilovsky, Irina Rish, and Timothée Lesort. 2023. Continual Pre-Training of Large Language Models: How to (re)warm your model?arXiv preprint arXiv:2308.04014(2023)

work page arXiv 2023
[25]

Asier Gutiérrez-Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao, Joaquín Silveira-Ocampo, Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodríguez-Penagos, Aitor Gonzalez-Agirre, and Marta Villegas. 2022. MarIA: Spanish Language Models.Procesamiento del Lenguaje Natural68 (2022), 39–60

work page 2022
[26]

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Aitor Gonzalez-Agirre, and Marta Villegas. 2024. Salamandra: A Spanish & Catalan Language Model Family.arXiv preprint arXiv:2402.12693(2024)

work page arXiv 2024
[27]

Alex Henry, Sainbayar Eavani, Kyunghyun Cho, and Orhan Firat. 2020. Query- Key Normalization for Transformer.arXiv preprint arXiv:2010.04559(2020)

work page arXiv 2020
[28]

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models.arXiv preprint arXiv:2106.09685(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Rawad Ibrahim, Lucas Caccia, Eugene Belilovsky, and Laurent Charlin. 2024. Simple replay buffer is all you need for sparse-reward continual learning.arXiv preprint arXiv:2402.15795(2024)

work page arXiv 2024
[31]

Andrej Karpathy. 2023. nanoGPT. https://github.com/karpathy/nanoGPT

work page 2023
[32]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences114, 13 (2017), 3521– 3526

work page 2017
[33]

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagy, et al. 2023. Openassistant conversations–democratizing large language model alignment.arXiv preprint arXiv:2304.07327(2023)

work page arXiv 2023
[34]

Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). 66–71

work page 2018
[35]

Pierre Lison and Jörg Tiedemann. 2016. Opensubtitles2016: Extracting large par- allel corpora from movie and tv subtitles. InProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 923–929

work page 2016
[36]

Xiang Liu, Tianyu Zhou, Guojing Tao, Xiao Liu, Ze Liu, Yuchen Cheng, Sheng Zhang, Yiren Zhang, Muse Chen, Zhaozhuo Chen, et al . 2024. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv preprint arXiv:2308.03840(2024)

work page arXiv 2024
[37]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. InProceedings of the 7th International Conference on Learning Representations (ICLR)

work page 2019
[39]

National Institute of Standards and Technology. 2024. National Vulnerability Database (NVD). https://nvd.nist.gov. Accessed: 2026-05-08

work page 2024
[40]

Ollama Team. 2023. Ollama. https://ollama.com

work page 2023
[41]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2024. Gorilla: Large Language Model Connected with Massive APIs. InAdvances in Neural Information Processing Systems (NeurIPS)

work page 2024
[42]

Guilherme Penedo, Quentin Malpure, Mohammed Al-Ghosien, Zaid Al-Halah, Adam de Wynter, Shlok Appalaraju, Ragy AlTawy, Sampo Pyysalo, Julien Launay, Yacine Jernite, et al. 2024. The FineWeb dataset.arXiv preprint arXiv:2406.02029 (2024)

work page arXiv 2024
[43]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun

work page
[44]

InThe Twelfth International Conference on Learning Representations (ICLR)

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. InThe Twelfth International Conference on Learning Representations (ICLR)

work page
[45]

Qwen Team. 2024. Qwen2. 5: A Family of Large Language Models.arXiv preprint arXiv:2405.00856(2024)

work page arXiv 2024
[46]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.OpenAI blog1, 8 (2019)

work page 2019
[47]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023
[48]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023
[49]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). 1715–1725

work page 2016
[50]

Noam Shazeer. 2020. Glu variants improve transformer.arXiv preprint arXiv:2002.05202(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[51]

Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. 2022. Curriculum Learning: A Survey.International Journal of Computer Vision130, 6 (2022), 1526–1565

work page 2022
[52]

Blake E Strom, Andy Applebaum, Douglas P Miller, Kathryn C Nickels, Adam G Pennington, and Cody B Thomas. 2018. MITRE ATT&CK: Design and Philosophy. InProceedings of the 2018 ACM Workshop on Learning from Authoritative Security Data. 1–11

work page 2018
[53]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yun Liu. 2024. Roformer: Enhanced transformer with rotary position embedding.Neurocom- puting568 (2024), 127063

work page 2024
[54]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InAdvances in neural information processing systems, Vol. 30

work page 2017
[56]

2001–2024

Wikimedia Foundation. 2001–2024. Wikipedia, the free encyclopedia. https: //www.wikipedia.org

work page 2001
[57]

Le, Tengyu Ma, and Adams Wei Yu

Sang Michael Xie, Hieu Pham, Xinyun Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, and Adams Wei Yu. 2023. DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

work page 2023
[58]

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 483–498

work page 2021
[59]

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems32 (2019)

work page 2019
[60]

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: A small-scale, compute-efficient open-source large language model.arXiv preprint arXiv:2401.02385(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024