arxiv: 2604.26217 · v1 · submitted 2026-04-29 · 💻 cs.CR

Recognition: unknown

OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis

Chaitanya Vilas Garware , Sharif Noor Zisad

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:25 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLMsecurity logsLoRAthreat classificationparameter efficient fine-tuningSOClog analysisTinyLlama

0 comments

The pith

Parameter-efficient fine-tuning of a small language model enables accurate threat classification on security logs using only 450 examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that a 1.1 billion parameter language model can be adapted with low-rank updates to analyze raw security logs for threats, techniques, and severity levels. The approach trains in minutes on modest hardware and yields substantial accuracy improvements over the base model on held-out test cases. If successful, it opens automated security operations to organizations that cannot maintain dedicated teams or expensive platforms. The work releases all code and data to encourage further development by the community.

Core claim

The central discovery is that fine-tuning TinyLlama-1.1B with LoRA on 450 SOC-specific examples produces a model that classifies threats at 68% accuracy, assesses severity at 58% accuracy, and achieves an F1 score of 0.68 on a 50-example held-out set, compared to near-zero performance from the untuned model.

What carries the argument

LoRA fine-tuning of the TinyLlama-1.1B model to process raw security log entries for automated classification tasks.

If this is right

Small businesses gain access to automated log analysis without large infrastructure investments.
The system can map detected threats to MITRE ATT&CK techniques directly from logs.
Fine-tuning requires minimal resources, completing in under five minutes on a single GPU.
Public availability of the adapter weights supports easy deployment and customization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar parameter-efficient methods could apply to log analysis in other technical domains like IT operations or scientific data.
The results hint that domain adaptation with small datasets can unlock practical LLM use cases in specialized fields.
Future work might explore combining this with rule-based systems for higher reliability in production.
Scaling the number of examples or testing on diverse log formats would strengthen evidence of broad applicability.

Load-bearing premise

The small collection of 450 examples captures enough of the patterns in actual security logs to allow the model to generalize to new, unseen logs without overfitting.

What would settle it

Running the fine-tuned model on a fresh collection of security logs from an independent source and observing whether accuracy remains near 68% or falls back toward baseline levels.

Figures

Figures reproduced from arXiv: 2604.26217 by Chaitanya Vilas Garware, Sharif Noor Zisad.

**Figure 1.** Figure 1: OpenSOC-AI system architecture. Raw log entries are formatted into view at source ↗

read the original abstract

Small and medium sized businesses (SMBs) face an escalating cybersecurity threat landscape, yet most lack the resources to staff full Security Operations Centers (SOCs) or deploy enterprise grade detection platforms. This paper presents OpenSOC-AI, a lightweight log analysis framework that uses parameter efficient fine tuning of a 1.1-billion parameter language model (TinyLlama-1.1B) to perform automated threat classification, MITRE ATT&CK technique mapping, and severity assessment on raw security log entries. Using Low-Rank Adaptation (LoRA) with only 12.6 million trainable parameters (roughly 1.13% of the base model), we fine tuned on 450 domain specific SOC examples in under five minutes on a single NVIDIA T4 GPU. Testing on a heldout set of 50 examples showed a 68% point gain in threat classification accuracy (from 0% to 68%), a 30% point gain in severity accuracy (from 28% to 58%), and an F1 score of 0.68 compared to the untuned baseline. Full codebase, adapter weights, and datasets are publicly released to support reproducibility and community extension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenSOC-AI shows how to fine-tune a small LLM on security logs with LoRA and releases the code, but the single 50-example test set leaves the reported gains hard to trust.

read the letter

The main thing to know is that they took TinyLlama-1.1B, applied LoRA to train just over 1% of the parameters on 450 SOC examples, and claim solid jumps in threat classification and severity scoring on a 50-example holdout, with the full code, adapters, and data released publicly. That release is the clearest positive here. It lets anyone reproduce the training in minutes on a T4 and try extending the work without starting from scratch. The domain focus on raw logs plus MITRE mapping and severity is a reasonable applied use of existing fine-tuning methods. The baseline comparison and F1 number give a concrete starting point for others to beat. The evaluation setup is the weak part. One fixed 50-example test set with no cross-validation, no split details, and no discussion of log sources or diversity makes it easy to overstate how well this generalizes. The untuned model at 0% on threat classification suggests much of the lift may come from teaching output format rather than deeper pattern recognition. With only 450 training examples, surface memorization remains a real possibility. The paper does not claim new theory or a general framework, just a lightweight practical tool. It is aimed at teams in smaller security operations who want an open starting point for log analysis instead of enterprise platforms. Readers working on applied LLM adaptation in security would get immediate value from the released artifacts. It deserves peer review because the implementation is transparent and the task is relevant, even though the experiments need more validation to support the accuracy claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OpenSOC-AI, a lightweight framework that applies Low-Rank Adaptation (LoRA) to fine-tune the 1.1B-parameter TinyLlama model on 450 domain-specific SOC log examples. The system performs automated threat classification, MITRE ATT&CK technique mapping, and severity assessment directly on raw security logs. It reports a 68 percentage-point gain in threat classification accuracy (0% to 68%), a 30 percentage-point gain in severity accuracy (28% to 58%), and an F1 score of 0.68 on a held-out set of 50 examples relative to the untuned baseline, with full code, adapter weights, and datasets released publicly.

Significance. If the reported gains hold under more rigorous validation, the work could meaningfully lower the barrier for small and medium businesses to implement basic automated log analysis without enterprise-scale resources. The emphasis on parameter-efficient tuning (only 1.13% trainable parameters) and the complete public release of code, weights, and data are clear strengths that support reproducibility and community follow-up.

major comments (2)

[Abstract and results section] Abstract and results section: The central performance claims (68 pp threat accuracy gain, 30 pp severity gain, F1=0.68) are derived from a single fixed 450/50 train/test split with no reported details on split stratification, log source diversity, template overlap between sets, or cross-validation. This single-split design leaves open the possibility that gains reflect memorization of surface patterns or output formatting rather than robust generalization, especially given the untuned baseline of 0% threat accuracy.
[Abstract and experimental setup] Abstract and experimental setup: No information is provided on baseline prompt construction, exact input formatting for the untuned model, data sourcing and labeling process, potential label noise, or statistical significance of the accuracy differences. These omissions are load-bearing because they prevent assessment of whether the fine-tuning teaches genuine threat reasoning or merely teaches the model to produce the expected output schema.

minor comments (2)

[Methodology] The manuscript would benefit from an explicit description of the LoRA hyperparameters (rank, alpha, target modules) and training hyperparameters (learning rate, epochs, batch size) in a dedicated table or subsection for reproducibility.
[Results] Figure or table presenting per-task confusion matrices or error analysis on the held-out set would help readers understand the nature of the remaining errors after fine-tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify limitations in our current evaluation that we will address through revisions to strengthen the work.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: The central performance claims (68 pp threat accuracy gain, 30 pp severity gain, F1=0.68) are derived from a single fixed 450/50 train/test split with no reported details on split stratification, log source diversity, template overlap between sets, or cross-validation. This single-split design leaves open the possibility that gains reflect memorization of surface patterns or output formatting rather than robust generalization, especially given the untuned baseline of 0% threat accuracy.

Authors: We agree that reliance on a single random split is a limitation given the modest dataset size. The 450/50 partition was performed randomly without stratification to retain coverage of infrequent threat classes. In the revised manuscript we will add an explicit description of the split procedure, report results from 5-fold cross-validation, quantify log source diversity across the collected examples, and check for template overlap between the training and test sets. The untuned baseline achieving 0% threat classification accuracy provides evidence that the fine-tuned model is not simply learning output formatting, as the baseline failed to generate any valid threat labels. revision: yes
Referee: [Abstract and experimental setup] Abstract and experimental setup: No information is provided on baseline prompt construction, exact input formatting for the untuned model, data sourcing and labeling process, potential label noise, or statistical significance of the accuracy differences. These omissions are load-bearing because they prevent assessment of whether the fine-tuning teaches genuine threat reasoning or merely teaches the model to produce the expected output schema.

Authors: We will expand the experimental setup section to include the exact prompt templates and input formatting used for both the baseline and LoRA-tuned models. Data were sourced from publicly available security log collections and labeled by the authors following standard SOC practices and the MITRE ATT&CK framework; we will describe this process and note potential label noise as a limitation. We will also report statistical significance tests (e.g., McNemar’s test) for the observed accuracy differences in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning and held-out evaluation

full rationale

The paper reports results from LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples followed by direct accuracy measurement on a separate 50-example held-out set. No derivation, first-principles argument, or equation chain is presented; the performance numbers (68 pp threat accuracy gain, 30 pp severity gain, F1=0.68) are obtained by running the trained model on the test split and comparing to the untuned baseline. This is standard supervised learning evaluation with no self-definitional reduction, fitted-input-as-prediction, or self-citation load-bearing step. The central claims rest on external data splits rather than any internal construction that forces the outcome.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on the assumption that the small curated dataset captures sufficient domain patterns for generalization and that LoRA hyperparameters chosen for this run produce stable improvements.

free parameters (1)

LoRA rank and scaling
Selected to reach 12.6 million trainable parameters; specific values not stated in abstract but directly control the reported efficiency and performance.

axioms (1)

domain assumption 450 labeled SOC examples are representative and sufficient for effective fine-tuning on threat classification and severity tasks
Invoked implicitly by the decision to train on this set and report gains on held-out data.

pith-pipeline@v0.9.0 · 5511 in / 1355 out tokens · 59225 ms · 2026-05-07T13:25:06.338221+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation
cs.CR 2026-05 conditional novelty 7.0

Strict regex parsing of LLM security log outputs introduces systematic errors that can make functional models appear non-functional, with a 76-point accuracy gap recovered by fuzzy parsing.

Reference graph

Works this paper leans on

10 extracted references · 6 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

2024 Data Breach Investigations Report,

Verizon, “2024 Data Breach Investigations Report,” Verizon Enterprise Solutions, 2024. [Online]. Available: https://www.verizon.com/business/ resources/reports/dbir/

2024
[2]

Cost of a Data Breach Report 2023,

Ponemon Institute, “Cost of a Data Breach Report 2023,” IBM Security,

2023
[3]

Available: https://www.ibm.com/reports/data-breach

[Online]. Available: https://www.ibm.com/reports/data-breach
[4]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” in Proc. ICLR, 2022. [Online]. Available: https://arxiv.org/abs/2106.09685

work page internal anchor Pith review arXiv 2022
[5]

QLoRA: Efficient Finetuning of Quantized LLMs

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” inProc. NeurIPS, 2023. [Online]. Available: https://arxiv.org/abs/2305.14314

work page internal anchor Pith review arXiv 2023
[6]

A Survey on Data Augmen- tation for Text Classification,

M. Bayer, M. A. Kaufhold, and C. Reuter, “A Survey on Data Augmen- tation for Text Classification,”ACM Computing Surveys, 2022. [Online]. Available: https://arxiv.org/abs/2107.03158

work page arXiv 2022
[7]

Revolutionizing Cyber Threat Detection with Large Language Models,

M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Magalh ˜aes, M. Deb- bah, and T. Lestable, “Revolutionizing Cyber Threat Detection with Large Language Models,”IEEE Access, 2023. [Online]. Available: https://arxiv.org/abs/2306.14263

work page arXiv 2023
[8]

TinyLlama: An Open-Source Small Language Model

P. Zhang, G. Zeng, T. Wang, and W. Lu, “TinyLlama: An Open-Source Small Language Model,”arXiv:2401.02385, 2024. [Online]. Available: https://arxiv.org/abs/2401.02385

work page internal anchor Pith review arXiv 2024
[9]

ATT&CK Framework v14,

MITRE Corporation, “ATT&CK Framework v14,” 2024. [Online]. Available: https://attack.mitre.org/

2024
[10]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

T. Wolf et al., “Transformers: State-of-the-Art Natural Language Pro- cessing,” inProc. EMNLP 2020 (System Demonstrations), 2020. [On- line]. Available: https://arxiv.org/abs/1910.03771 APPENDIX TABLE IV FULLTRAININGHYPERPARAMETERCONFIGURATION Parameter Value Base Model TinyLlama/TinyLlama-1.1B-Chat-v1.0 Quantization 4-bit NF4, double quant., fp16 compute...

work page internal anchor Pith review arXiv 2020