Recognition: 2 theorem links
· Lean TheoremWhen the Ruler is Broken: Parsing-Induced Suppression in LLM-Based Security Log Evaluation
Pith reviewed 2026-05-11 01:34 UTC · model grok-4.3
The pith
Strict regex parsers can suppress reported LLM threat accuracy to 0% on outputs that a fuzzy parser scores at 76%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A strict regex parser applied to the free-form outputs of a LoRA-tuned TinyLlama model for security log threat classification produced 0% threat accuracy, whereas a corrected fuzzy parser on the same outputs and evaluation set recovered 76% threat accuracy; severity accuracy remained constant at 58% under both parsers. This isolates field-name format mismatch as the sole cause. Residual errors concentrated in reconnaissance, brute force, and credential stuffing. The work proposes SOC-Bench v0 with a standardized 13-category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and public scoring script.
What carries the argument
Parsing-induced suppression: the mechanism by which strict regex field extraction discards semantically correct LLM outputs due to minor format variations, understating true model performance.
If this is right
- Threat accuracy can drop to zero solely from parser strictness while the underlying model remains functional.
- Severity accuracy provides a stable control that stays unchanged across parsers.
- Misclassifications cluster in behaviorally adjacent categories such as reconnaissance, brute force, and credential stuffing.
- Standardizing fuzzy extraction plus a fixed taxonomy removes parser-specific distortion from reported results.
Where Pith is reading between the lines
- Many existing LLM-based security evaluations may have understated model performance for the same parser-related reason.
- The proposed benchmark could enable fairer head-to-head comparisons of different models on security tasks.
- Models appear more robust at severity assessment than at fine-grained threat typing when output formats are allowed to vary.
Load-bearing premise
That the fuzzy parser correctly recovers the model's intended classifications without introducing its own systematic bias or over-accepting incorrect outputs.
What would settle it
Independent human re-labeling of the 50 model outputs to test whether the fuzzy parser's 76% threat accuracy matches human judgment on the actual classifications.
Figures
read the original abstract
LLM-based SOC log classifiers are commonly evaluated using regular-expression pipelines that extract structured fields from free-form model output. We demonstrate that this practice introduces a class of silent, systematic evaluation errors, which we term parsing-induced suppression that can cause a fully functional model to appear completely non-functional. Using OpenSOC-AI, a LoRA fine-tuned TinyLlama-1.1B system for security log threat classification, as a reproducible case study, we show that a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy on the same model outputs and the same evaluation set. A gap of 76 percentage points attributable entirely to evaluation methodology. Severity accuracy remained constant at 58% under both parsers, providing a built-in control that isolates field name format mismatch as the causal mechanism rather than model degradation. For external reference, Claude Sonnet evaluated zero-shot on the same 50 example set achieved 88% threat accuracy and 58% severity accuracy under the same fuzzy protocol. Residual errors under fuzzy evaluation concentrate in three categories including reconnaissance, brute force, and credential stuffing, each contributing all 4 misclassifications, a pattern that reflects class-boundary difficulty among behaviorally adjacent log types rather than global model failure. We propose SOC-Bench v0, a benchmark framework comprising a standardized 13 category threat taxonomy, minimum statistical power requirements, fuzzy field extraction specification, and a public scoring script intended to prevent parser specific accuracy distortion in future SOC LLM research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that regex-based parsing pipelines for extracting structured fields from LLM outputs in security log classification introduce 'parsing-induced suppression,' a systematic error that can make functional models appear non-functional. Using OpenSOC-AI (a LoRA fine-tuned TinyLlama-1.1B) as a case study on a 50-example set, a strict regex parser yields 0% threat accuracy while a fuzzy parser recovers 76% on identical outputs, with severity accuracy constant at 58% as a control isolating field-name mismatch. Claude Sonnet achieves 88% threat accuracy under the fuzzy protocol; errors concentrate in reconnaissance/brute-force/credential-stuffing classes. The authors propose SOC-Bench v0 with a 13-category taxonomy, statistical power requirements, fuzzy extraction specification, and public scoring script.
Significance. If the central measurement holds, the work identifies a previously under-appreciated methodological pitfall that can distort reported performance of LLM-based SOC classifiers by tens of percentage points. The built-in control (unchanged severity accuracy) and reproducible case study on fixed model outputs provide a clean isolation of the parser effect. The SOC-Bench proposal, if implemented with the promised public script, would be a concrete contribution toward more robust evaluation standards in the field.
major comments (3)
- [Case study and evaluation methodology] The fuzzy parser is load-bearing for the headline 76 pp gap, yet the manuscript supplies no explicit rules, mismatch examples, validation set, or code (see case-study description and evaluation protocol). Without this, it remains possible that the fuzzy logic selectively accepts outputs that a correctly specified strict parser would reject, undermining the claim that the gap is attributable entirely to suppression rather than correction bias.
- [Results and error analysis] The evaluation uses only 50 examples with no statistical test (e.g., McNemar or bootstrap CI) on the 0% vs 76% threat-accuracy difference. While the constant 58% severity accuracy provides a useful control, the small n limits the ability to rule out sampling variability or selective inflation on the remaining cases, weakening the causal attribution to parser mismatch alone.
- [Residual error analysis] The statement that 'each contributing all 4 misclassifications' for the three error-prone classes requires the total error count and per-class breakdown to be shown explicitly; with n=50 it is unclear whether this pattern reflects class-boundary difficulty or simply the distribution of the few errors that remain after fuzzy parsing.
minor comments (2)
- [Abstract] The abstract asserts the 76 pp gap is 'attributable entirely to evaluation methodology' without immediately qualifying the sample size or the reliance on the unvalidated fuzzy parser; a parenthetical note would improve precision.
- [Benchmark proposal] The SOC-Bench v0 proposal mentions a 'fuzzy field extraction specification' but does not indicate whether it will be released as executable code or only as prose rules, which affects reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas for improving transparency and statistical support. We address each major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: The fuzzy parser is load-bearing for the headline 76 pp gap, yet the manuscript supplies no explicit rules, mismatch examples, validation set, or code (see case-study description and evaluation protocol). Without this, it remains possible that the fuzzy logic selectively accepts outputs that a correctly specified strict parser would reject, undermining the claim that the gap is attributable entirely to suppression rather than correction bias.
Authors: We agree that the absence of explicit fuzzy parsing rules, mismatch examples, validation details, and code in the current manuscript limits full scrutiny of the parser. The constant 58% severity accuracy under both strict and fuzzy parsers provides evidence that the improvement is isolated to threat-field name mismatches rather than broad correction bias, as severity extraction rules were unchanged. Nevertheless, to enable independent verification, the revised manuscript will include the complete fuzzy matching rules, concrete examples of strict-regex failures versus fuzzy successes, the validation procedure used to develop the fuzzy parser, and the promised public scoring script. These additions will allow readers to assess whether the fuzzy logic introduces selective acceptance beyond field-name normalization. revision: yes
-
Referee: The evaluation uses only 50 examples with no statistical test (e.g., McNemar or bootstrap CI) on the 0% vs 76% threat-accuracy difference. While the constant 58% severity accuracy provides a useful control, the small n limits the ability to rule out sampling variability or selective inflation on the remaining cases, weakening the causal attribution to parser mismatch alone.
Authors: We acknowledge that the sample size of 50 examples is modest and that formal statistical tests were not reported. The 76-percentage-point difference is large enough that sampling variability is unlikely to account for it, and the unchanged severity accuracy serves as an internal control against general model or evaluation drift. In the revision we will add McNemar's test for the paired threat-accuracy difference and bootstrap confidence intervals around both accuracy figures. We will also include a brief power discussion noting the limitations of n=50 while emphasizing that the control variable and the magnitude of the observed gap support attribution to parser mismatch. revision: yes
-
Referee: The statement that 'each contributing all 4 misclassifications' for the three error-prone classes requires the total error count and per-class breakdown to be shown explicitly; with n=50 it is unclear whether this pattern reflects class-boundary difficulty or simply the distribution of the few errors that remain after fuzzy parsing.
Authors: We agree that an explicit per-class error breakdown is required. Under the fuzzy parser, 12 errors occur on the 50-example set (76% accuracy). These 12 errors are distributed as 4 misclassifications in each of the three classes (reconnaissance, brute force, credential stuffing). The revised manuscript will include a table presenting the full per-class error counts, the specific misclassification pairs observed, and the total error count. This will make clear that the residual errors are concentrated among behaviorally adjacent classes rather than being uniformly distributed, supporting the interpretation of class-boundary difficulty. revision: yes
Circularity Check
No significant circularity; central result is direct empirical measurement on fixed outputs
full rationale
The paper's derivation consists of applying two independent parsers (strict regex and corrected fuzzy) to the identical set of model outputs from a fixed evaluation set of 50 examples, then reporting the resulting accuracy differences while using constant severity accuracy (58%) as an internal control variable. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the chain; the 76 pp threat-accuracy gap is presented as a measured outcome rather than a constructed prediction. The SOC-Bench v0 proposal is a forward specification without reducing to prior results by definition. This is a self-contained empirical demonstration with no reduction of outputs to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The model's output contains extractable threat and severity fields that a fuzzy parser can recover without systematic distortion.
invented entities (1)
-
parsing-induced suppression
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a strict regex parser reported 0% threat accuracy while a corrected fuzzy parser recovered 76% threat accuracy... Severity accuracy remained constant at 58% under both parsers
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat.equivNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SOC-Bench v0... fuzzy field extraction specification
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis
C. V . Garware and S. N. Zisad, “OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis,”arXiv preprint arXiv:2604.26217, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
LoRA: Low-Rank Adaptation of Large Language Models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” inProc. ICLR, 2022
work page 2022
-
[3]
QLoRA: Efficient Finetuning of Quantized LLMs,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient Finetuning of Quantized LLMs,” inProc. NeurIPS, 2023
work page 2023
-
[4]
TinyLlama: An Open-Source Small Language Model
P. Zhang, G. Zeng, T. Wang, and W. Lu, “TinyLlama: An Open-Source Small Language Model,”arXiv:2401.02385, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Revolutionizing Cyber Threat Detection with Large Language Models,
M. A. Ferrag, M. Ndhlovu, N. Tihanyi, L. C. Magalhaes, M. Debbah, and T. Lestable, “Revolutionizing Cyber Threat Detection with Large Language Models,”IEEE Access, 2023
work page 2023
-
[6]
MITRE Corporation, “ATT&CK Framework v14,” 2024. [Online]. Available: https://attack.mitre.org/
work page 2024
-
[7]
On Calibration of Modern Neural Networks,
C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” inProc. ICML, 2017
work page 2017
-
[8]
Probable Inference, the Law of Succession, and Statistical Inference,
E. B. Wilson, “Probable Inference, the Law of Succession, and Statistical Inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927
work page 1927
-
[9]
Show Your Work: Improved Reporting of Experimental Results,
J. Dodge, S. Gururangan, D. Card, R. Schwartz, and N. A. Smith, “Show Your Work: Improved Reporting of Experimental Results,” in Proc. EMNLP, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.