pith. machine review for the scientific record. sign in

arxiv: 2605.07293 · v1 · submitted 2026-05-08 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

When the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:34 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM evaluationsecurity log classificationparsing errorsthreat detectionSOC benchmarkingregex extractionfuzzy parsingevaluation methodology
0
0 comments X

The pith

Strict regex parsers can suppress reported LLM threat accuracy to 0% on outputs that a fuzzy parser scores at 76%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common evaluation pipelines relying on strict regular-expression extraction from free-form LLM outputs introduce systematic errors in security log threat classification. On the OpenSOC-AI model, this practice produced 0% threat accuracy while the identical outputs reached 76% under a fuzzy parser, with severity accuracy holding steady at 58% as a control. The 76-point gap arises entirely from parser mismatch rather than model failure. The authors introduce SOC-Bench v0, a framework with a fixed 13-category taxonomy, statistical power requirements, fuzzy extraction rules, and a public scorer to prevent similar distortions in future evaluations.

Core claim

A strict regex parser applied to the free-form outputs of a LoRA-tuned TinyLlama model for security log threat classification produced 0% threat accuracy, whereas a corrected fuzzy parser on the same outputs and evaluation set recovered 76% threat accuracy; severity accuracy remained constant at 58% under both parsers. This isolates field-name format mismatch as the sole cause. Residual errors concentrated in reconnaissance, brute force, and credential stuffing. The work proposes SOC-Bench v0 with a standardized 13-category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and public scoring script.

What carries the argument

Parsing-induced suppression: the mechanism by which strict regex field extraction discards semantically correct LLM outputs due to minor format variations, understating true model performance.

If this is right

  • Threat accuracy can drop to zero solely from parser strictness while the underlying model remains functional.
  • Severity accuracy provides a stable control that stays unchanged across parsers.
  • Misclassifications cluster in behaviorally adjacent categories such as reconnaissance, brute force, and credential stuffing.
  • Standardizing fuzzy extraction plus a fixed taxonomy removes parser-specific distortion from reported results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing LLM-based security evaluations may have understated model performance for the same parser-related reason.
  • The proposed benchmark could enable fairer head-to-head comparisons of different models on security tasks.
  • Models appear more robust at severity assessment than at fine-grained threat typing when output formats are allowed to vary.

Load-bearing premise

That the fuzzy parser correctly recovers the model's intended classifications without introducing its own systematic bias or over-accepting incorrect outputs.

What would settle it

Independent human re-labeling of the 50 model outputs to test whether the fuzzy parser's 76% threat accuracy matches human judgment on the actual classifications.

Figures

Figures reproduced from arXiv: 2605.07293 by Chaitanya Vilas Garware, Sharif Noor Zisad.

Figure 1
Figure 1. Figure 1: Strict vs. fuzzy parser comparison on TinyLlama-1.1B + LoRA ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cross-model comparison under identical fuzzy evaluation protocol [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure concentration under fuzzy evaluation. All 12 residual errors [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

LLM-based SOC log classifiers are commonly evaluated using regular-expression pipelines that extract structured fields from free-form model output. We demonstrate that this practice introduces a class of silent, systematic evaluation errors, which we term parsing-induced suppression that can cause a fully functional model to appear completely non-functional. Using OpenSOC-AI, a LoRA fine-tuned TinyLlama-1.1B system for security log threat classification, as a reproducible case study, we show that a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy on the same model outputs and the same evaluation set. A gap of 76 percentage points attributable entirely to evaluation methodology. Severity accuracy remained constant at 58% under both parsers, providing a built-in control that isolates field name format mismatch as the causal mechanism rather than model degradation. For external reference, Claude Sonnet evaluated zero-shot on the same 50 example set achieved 88% threat accuracy and 58% severity accuracy under the same fuzzy protocol. Residual errors under fuzzy evaluation concentrate in three categories including reconnaissance, brute force, and credential stuffing, each contributing all 4 misclassifications, a pattern that reflects class-boundary difficulty among behaviorally adjacent log types rather than global model failure. We propose SOC-Bench v0, a benchmark framework comprising a standardized 13 category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and a public scoring script intended to prevent parser specific accuracy distortion in future SOC LLM research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that regex-based parsing pipelines for extracting structured fields from LLM outputs in security log classification introduce 'parsing-induced suppression,' a systematic error that can make functional models appear non-functional. Using OpenSOC-AI (a LoRA fine-tuned TinyLlama-1.1B) as a case study on a 50-example set, a strict regex parser yields 0% threat accuracy while a fuzzy parser recovers 76% on identical outputs, with severity accuracy constant at 58% as a control isolating field-name mismatch. Claude Sonnet achieves 88% threat accuracy under the fuzzy protocol; errors concentrate in reconnaissance/brute-force/credential-stuffing classes. The authors propose SOC-Bench v0 with a 13-category taxonomy, statistical power requirements, fuzzy extraction specification, and public scoring script.

Significance. If the central measurement holds, the work identifies a previously under-appreciated methodological pitfall that can distort reported performance of LLM-based SOC classifiers by tens of percentage points. The built-in control (unchanged severity accuracy) and reproducible case study on fixed model outputs provide a clean isolation of the parser effect. The SOC-Bench proposal, if implemented with the promised public script, would be a concrete contribution toward more robust evaluation standards in the field.

major comments (3)
  1. [Case study and evaluation methodology] The fuzzy parser is load-bearing for the headline 76 pp gap, yet the manuscript supplies no explicit rules, mismatch examples, validation set, or code (see case-study description and evaluation protocol). Without this, it remains possible that the fuzzy logic selectively accepts outputs that a correctly specified strict parser would reject, undermining the claim that the gap is attributable entirely to suppression rather than correction bias.
  2. [Results and error analysis] The evaluation uses only 50 examples with no statistical test (e.g., McNemar or bootstrap CI) on the 0% vs 76% threat-accuracy difference. While the constant 58% severity accuracy provides a useful control, the small n limits the ability to rule out sampling variability or selective inflation on the remaining cases, weakening the causal attribution to parser mismatch alone.
  3. [Residual error analysis] The statement that 'each contributing all 4 misclassifications' for the three error-prone classes requires the total error count and per-class breakdown to be shown explicitly; with n=50 it is unclear whether this pattern reflects class-boundary difficulty or simply the distribution of the few errors that remain after fuzzy parsing.
minor comments (2)
  1. [Abstract] The abstract asserts the 76 pp gap is 'attributable entirely to evaluation methodology' without immediately qualifying the sample size or the reliance on the unvalidated fuzzy parser; a parenthetical note would improve precision.
  2. [Benchmark proposal] The SOC-Bench v0 proposal mentions a 'fuzzy field extraction specification' but does not indicate whether it will be released as executable code or only as prose rules, which affects reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for improving transparency and statistical support. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: The fuzzy parser is load-bearing for the headline 76 pp gap, yet the manuscript supplies no explicit rules, mismatch examples, validation set, or code (see case-study description and evaluation protocol). Without this, it remains possible that the fuzzy logic selectively accepts outputs that a correctly specified strict parser would reject, undermining the claim that the gap is attributable entirely to suppression rather than correction bias.

    Authors: We agree that the absence of explicit fuzzy parsing rules, mismatch examples, validation details, and code in the current manuscript limits full scrutiny of the parser. The constant 58% severity accuracy under both strict and fuzzy parsers provides evidence that the improvement is isolated to threat-field name mismatches rather than broad correction bias, as severity extraction rules were unchanged. Nevertheless, to enable independent verification, the revised manuscript will include the complete fuzzy matching rules, concrete examples of strict-regex failures versus fuzzy successes, the validation procedure used to develop the fuzzy parser, and the promised public scoring script. These additions will allow readers to assess whether the fuzzy logic introduces selective acceptance beyond field-name normalization. revision: yes

  2. Referee: The evaluation uses only 50 examples with no statistical test (e.g., McNemar or bootstrap CI) on the 0% vs 76% threat-accuracy difference. While the constant 58% severity accuracy provides a useful control, the small n limits the ability to rule out sampling variability or selective inflation on the remaining cases, weakening the causal attribution to parser mismatch alone.

    Authors: We acknowledge that the sample size of 50 examples is modest and that formal statistical tests were not reported. The 76-percentage-point difference is large enough that sampling variability is unlikely to account for it, and the unchanged severity accuracy serves as an internal control against general model or evaluation drift. In the revision we will add McNemar's test for the paired threat-accuracy difference and bootstrap confidence intervals around both accuracy figures. We will also include a brief power discussion noting the limitations of n=50 while emphasizing that the control variable and the magnitude of the observed gap support attribution to parser mismatch. revision: yes

  3. Referee: The statement that 'each contributing all 4 misclassifications' for the three error-prone classes requires the total error count and per-class breakdown to be shown explicitly; with n=50 it is unclear whether this pattern reflects class-boundary difficulty or simply the distribution of the few errors that remain after fuzzy parsing.

    Authors: We agree that an explicit per-class error breakdown is required. Under the fuzzy parser, 12 errors occur on the 50-example set (76% accuracy). These 12 errors are distributed as 4 misclassifications in each of the three classes (reconnaissance, brute force, credential stuffing). The revised manuscript will include a table presenting the full per-class error counts, the specific misclassification pairs observed, and the total error count. This will make clear that the residual errors are concentrated among behaviorally adjacent classes rather than being uniformly distributed, supporting the interpretation of class-boundary difficulty. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central result is direct empirical measurement on fixed outputs

full rationale

The paper's derivation consists of applying two independent parsers (strict regex and corrected fuzzy) to the identical set of model outputs from a fixed evaluation set of 50 examples, then reporting the resulting accuracy differences while using constant severity accuracy (58%) as an internal control variable. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the chain; the 76 pp threat-accuracy gap is presented as a measured outcome rather than a constructed prediction. The SOC-Bench v0 proposal is a forward specification without reducing to prior results by definition. This is a self-contained empirical demonstration with no reduction of outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the assumption that the fuzzy parser faithfully reflects model intent and that the 50-example set plus severity control isolates the parser effect; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption The model's output contains extractable threat and severity fields that a fuzzy parser can recover without systematic distortion.
    Invoked when claiming the 76% recovery is the true model performance rather than parser artifact.
invented entities (1)
  • parsing-induced suppression no independent evidence
    purpose: To name the class of silent evaluation errors caused by strict regex mismatch on LLM outputs.
    New term introduced to describe the observed phenomenon; no independent evidence beyond the case study.

pith-pipeline@v0.9.0 · 5576 in / 1429 out tokens · 40083 ms · 2026-05-11T01:34:10.477754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis

    C. V . Garware and S. N. Zisad, “OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis,”arXiv preprint arXiv:2604.26217, 2026

  2. [2]

    LoRA: Low-Rank Adaptation of Large Language Models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” inProc. ICLR, 2022

  3. [3]

    QLoRA: Efficient Finetuning of Quantized LLMs,

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” inProc. NeurIPS, 2023

  4. [4]

    TinyLlama: An Open-Source Small Language Model

    P. Zhang, G. Zeng, T. Wang, and W. Lu, “TinyLlama: An Open-Source Small Language Model,”arXiv:2401.02385, 2024

  5. [5]

    Revolutionizing Cyber Threat Detection with Large Language Models,

    M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Magalhaes, M. Debbah, and T. Lestable, “Revolutionizing Cyber Threat Detection with Large Language Models,”IEEE Access, 2023

  6. [6]

    ATT&CK Framework v14,

    MITRE Corporation, “ATT&CK Framework v14,” 2024. [Online]. Available: https://attack.mitre.org/

  7. [7]

    On Calibration of Modern Neural Networks,

    C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” inProc. ICML, 2017

  8. [8]

    Probable Inference, the Law of Succession, and Statistical Inference,

    E. B. Wilson, “Probable Inference, the Law of Succession, and Statistical Inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927

  9. [9]

    Show Your Work: Improved Reporting of Experimental Results,

    J. Dodge, S. Gururangan, D. Card, R. Schwartz, and N. A. Smith, “Show Your Work: Improved Reporting of Experimental Results,” in Proc. EMNLP, 2019