pith. sign in

arxiv: 2604.20553 · v1 · submitted 2026-04-22 · 💻 cs.SE

DeepParse: Hybrid Log Parsing with LLM-Synthesized Regex Masks

Pith reviewed 2026-05-09 23:52 UTC · model grok-4.3

classification 💻 cs.SE
keywords log parsinglog templatesvariable extractionhybrid parsingLLM regex synthesisDrain algorithmanomaly detectionstructured logs
0
0 comments X

The pith

DeepParse has an LLM synthesize reusable regex masks from small log samples, then applies them deterministically inside the Drain algorithm to reach 97.6% average parsing accuracy at lower cost than pure heuristic or LLM methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern distributed systems produce vast numbers of free-form log messages that must be converted into structured templates for reliability monitoring, security analysis, and anomaly detection. Heuristic parsers such as Drain are fast but lose accuracy on complex variables, while full LLM parsing generalizes better yet becomes prohibitively expensive when run on every line. DeepParse addresses the tradeoff by first asking an LLM to extract variable patterns from only a few examples and then embedding those patterns as fixed regex masks that the Drain algorithm matches deterministically. Across sixteen benchmark datasets this separation of one-time reasoning from repeated execution delivers higher accuracy in variable extraction and greater consistency than either baseline. When the resulting structured logs feed an anomaly detection pipeline, false alarms drop by more than thirty percent and inference latency falls by thirty-six percent.

Core claim

DeepParse automatically mines reusable variable patterns from small log samples using an LLM and then applies them deterministically through the Drain algorithm. By separating the reasoning phase from execution, the method enables accurate, scalable, and cost-efficient log structuring without relying on brittle handcrafted rules or per-line neural inference.

What carries the argument

LLM-synthesized regex masks that are inserted into the Drain log-parsing algorithm to replace its heuristic variable detection step.

If this is right

  • Average variable extraction accuracy reaches 97.6 percent across sixteen standard log-parsing benchmarks.
  • Consistency of the extracted templates exceeds that of both heuristic and LLM-only baselines.
  • False-alarm rate in a downstream anomaly detector drops by more than thirty percent.
  • End-to-end inference latency for the anomaly pipeline falls by thirty-six percent relative to heuristic baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Occasional LLM synthesis steps could replace continuous model inference for any text domain whose underlying patterns change slowly.
  • The same one-time-pattern-then-deterministic-match design might improve other high-volume text-processing pipelines that currently pay for model calls on every record.
  • Engineers would still need a practical rule for deciding when accumulated log drift requires the LLM synthesis step to be rerun.

Load-bearing premise

The patterns the LLM extracts from small samples remain general enough to handle new and evolving log formats without needing fresh LLM calls or manual rule updates for each change.

What would settle it

Running DeepParse on a fresh collection of logs whose formats differ from the original samples and finding that its variable-extraction accuracy falls below that of either a full LLM parser or carefully hand-tuned rules.

Figures

Figures reproduced from arXiv: 2604.20553 by Amir Shetaia, Sean Kauffman.

Figure 1
Figure 1. Figure 1: illustrates the DeepParse workflow. An LLM synthesizes regex masks from a small sample of logs offline; these masks then guide a deterministic Drain parser online. SYSTEM LOGS DRAIN ALGORITHM LOGS SAMPLING LOG TEMPLATES GENERATED REGEX PATTERNS SAMPLED LOGS FINE-TUNED DEEPSEEK-R1:8B [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy vs. number of shots (Hadoop). Saturation occurs at approximately 50 labeled exam￾ples. Parse regex list is synthesized once from these 50 logs and then reused for all logs per system. In practice, when ground truth is unavailable, practitioners can start with the default of 50 and monitor parsing out￾put stability: if increasing the sample size does not change the synthesized mask list, saturation… view at source ↗
Figure 4
Figure 4. Figure 4: reports per-batch inference latency for LogBERT preprocessing under two pipelines: the original Drain-based parser and DeepParse’s mask￾first parser. Each bar shows the mean latency (ms) over the evaluation run for Authentication and Con￾figuration logs; the improvement comes from reducing the effective vocabulary by masking high-cardinality identifiers, which lowers downstream embedding and sequence proce… view at source ↗
read the original abstract

Modern distributed systems produce massive, heterogeneous logs essential for reliability, security, and anomaly detection. Converting these free-form messages into structured templates (log parsing) is challenging due to evolving formats and limited labeled data. Machine-learning-based parsers like Drain are fast but accuracy often degrades on complex variables, while Large Language Models (LLMs) offer better generalization but incur prohibitive inference costs. This paper presents DeepParse, a hybrid framework that automatically mines reusable variable patterns from small log samples using an LLM, then applies them deterministically through the Drain algorithm. By separating the reasoning phase from execution, DeepParse enables accurate, scalable, and cost-efficient log structuring without relying on brittle handcrafted rules or per-line neural inference. Across 16 benchmark datasets, DeepParse achieves higher accuracy in variable extraction (97.6% average Parsing Accuracy) and better consistency than both heuristic and LLM-only baselines. Integrating DeepParse into an anomaly detection pipeline reduced false alarms by over 30% and reduced inference latency by 36% compared to heuristic baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes DeepParse, a hybrid log parsing framework that uses an LLM to automatically synthesize reusable regex masks (variable patterns) from small log samples, which are then applied deterministically within the Drain algorithm. This separates the reasoning phase (LLM) from execution (heuristic), aiming to achieve high accuracy and consistency on heterogeneous logs without handcrafted rules or per-line LLM inference. On 16 benchmark datasets, it reports 97.6% average parsing accuracy in variable extraction, outperforming both heuristic and LLM-only baselines in consistency; integration into an anomaly detection pipeline yields >30% false-alarm reduction and 36% lower inference latency versus heuristic baselines.

Significance. If the empirical claims are substantiated, the hybrid design offers a practical advance for log parsing in distributed systems by mitigating the accuracy limitations of pure heuristics like Drain on complex variables while avoiding the high inference costs of standalone LLMs. The downstream gains in anomaly detection and latency reduction highlight potential impact on reliability and security pipelines handling evolving log formats.

major comments (1)
  1. [Evaluation] Evaluation section: The central claims rest on 97.6% average Parsing Accuracy, superior consistency, >30% false-alarm reduction, and 36% latency improvement across 16 benchmarks, yet the manuscript supplies no details on experimental setup, data splits, pattern validation procedures, controls for selection bias, or full per-dataset results. This absence prevents assessment of reproducibility and validity of the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recommending major revision. We agree that the Evaluation section requires substantially more detail to support reproducibility and to allow proper assessment of the reported gains.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claims rest on 97.6% average Parsing Accuracy, superior consistency, >30% false-alarm reduction, and 36% latency improvement across 16 benchmarks, yet the manuscript supplies no details on experimental setup, data splits, pattern validation procedures, controls for selection bias, or full per-dataset results. This absence prevents assessment of reproducibility and validity of the reported gains.

    Authors: We agree that the current manuscript lacks the necessary experimental details. In the revised version we will expand the Evaluation section with: (1) a complete description of the experimental setup, including the LLM used for mask synthesis, hardware, and software versions; (2) explicit references and characteristics of all 16 benchmark datasets; (3) the sample-selection procedure for regex mining together with controls for selection bias (e.g., random sampling across multiple runs); (4) the pattern-validation protocol, including how masks were tested for reusability on held-out lines; and (5) full per-dataset tables reporting Parsing Accuracy, consistency scores, anomaly-detection false-alarm rates, and latency for DeepParse and all baselines. These additions will enable independent reproduction of the 97.6 % average accuracy, the >30 % false-alarm reduction, and the 36 % latency improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with benchmark validation

full rationale

The paper describes a hybrid log-parsing pipeline (LLM pattern synthesis from small samples followed by deterministic Drain application) and reports empirical results on 16 datasets (97.6% average parsing accuracy, latency and false-alarm reductions). No equations, derivations, or self-citations are present in the provided text that reduce any claimed result to its own inputs by construction. The central claims rest on external benchmark comparisons rather than fitted parameters renamed as predictions or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM pattern synthesis from small samples produces reusable masks that generalize across heterogeneous logs; this is a domain assumption without independent verification supplied in the abstract.

axioms (1)
  • domain assumption LLM can synthesize generalizable regex patterns from small log samples that remain effective on unseen logs
    Core premise enabling separation of reasoning and execution phases; invoked to justify cost savings and accuracy gains.

pith-pipeline@v0.9.0 · 5472 in / 1380 out tokens · 46475 ms · 2026-05-09T23:52:14.664875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    https://regex101.com/, 2025

    Regex101: Online regular expression tester and debugger. https://regex101.com/, 2025. Ac- cessed: 2025-10-05

  2. [2]

    Observability best practices

    Amazon Web Services. Observability best practices. https://aws.amazon.com/ observability/, 2023. Accessed: January 2026

  3. [3]

    Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin

    Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-determinism of ”determin- istic” llm settings, 2025

  4. [4]

    System log parsing with large language models: A review, 2025

    Viktor Beck, Max Landauer, Markus Wurzen- berger, Florian Skopik, and Andreas Rauber. System log parsing with large language models: A review, 2025

  5. [5]

    Automatic root cause analysis via large language models for cloud inci- dents

    Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xue- dong Gao, Hao Fan, Ming Wen, Jun Zeng, Supriyo Ghosh, Xuchao Zhang, Chaoyun Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Tianyin Xu. Automatic root cause analysis via large language models for cloud inci- dents. InProceedings of the Nineteenth European Conferen...

  6. [6]

    Logram: Efficient log parsing using nn-gram dictionaries.IEEE Transactions on Software Engineering, 48(3):879– 892, 2022

    Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. Logram: Efficient log parsing using nn-gram dictionaries.IEEE Transactions on Software Engineering, 48(3):879– 892, 2022

  7. [7]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  8. [8]

    Early exploration of using chatgpt for log-based anomaly detection on parallel file systems logs

    Chris Egersdoerfer, Di Zhang, and Dong Dai. Early exploration of using chatgpt for log-based anomaly detection on parallel file systems logs. InProceedings of the 32nd International Sym- posium on High-Performance Parallel and Dis- tributed Computing, HPDC ’23, page 315–316, New York, NY, USA, 2023. Association for Com- puting Machinery

  9. [9]

    Execution anomaly detection in distributed systems through unstructured log analysis

    Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 Ninth IEEE International Conference on Data Mining, pages 149–158, 2009

  10. [10]

    Log- bert: Log anomaly detection via bert, 2021

    Hao Guo, Shuhan Yuan, and Xintao Wu. Log- bert: Log anomaly detection via bert, 2021

  11. [11]

    Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. Drain: An online log pars- ing approach with fixed depth tree. In2017 IEEE International Conference on Web Services (ICWS), pages 33–40, 2017

  12. [12]

    A survey on automated log analysis for reliability engineering

    Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. A survey on automated log analysis for reliability engineering. ACM Computing Surveys (CSUR), 54(6):1–37, 2021

  13. [13]

    Loghub 2.0: Towards real- world log analytics at scale

    Shilin He, Peng Zhao, Jieming Li, Zibin Zheng, and Michael R Lyu. Loghub 2.0: Towards real- world log analytics at scale. https://github. com/logpai/loghub-2.0, 2024

  14. [14]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  15. [15]

    Shaukat Ali Khan, Raimundas Matuleviˇ cius, Shilin He, Xuhui Chen, and Michael R. Lyu. Guidelines for assessing the accuracy of log mes- sage template identification techniques.Empiri- cal Software Engineering, 28(1):1–33, 2023

  16. [16]

    Zanis Ali Khan, Donghwan Shin, Domenico Bian- culli, and Lionel C. Briand. Impact of log pars- ing on deep learning-based anomaly detection. Empirical Software Engineering, 29(6), August 2024

  17. [17]

    Log parsing with prompt-based few-shot learning

    Van-Hoang Le and Hongyu Zhang. Log parsing with prompt-based few-shot learning. InPro- ceedings of the 45th International Conference on Software Engineering, ICSE ’23, page 2438–2449. IEEE Press, 2023

  18. [18]

    Lanobert: System log anomaly detection based on bert masked language model.Applied Soft Computing, 146:110689, 2023

    Yukyung Lee, Jina Kim, and Pilsung Kang. Lanobert: System log anomaly detection based on bert masked language model.Applied Soft Computing, 146:110689, 2023

  19. [19]

    Length matters: Clustering sys- tem log messages using length of words, 2016

    Keiichi Shima. Length matters: Clustering sys- tem log messages using length of words, 2016

  20. [20]

    Logsig: generating system events from raw tex- tual logs

    Liang Tang, Tao Li, and Chang-Shing Perng. Logsig: generating system events from raw tex- tual logs. InProceedings of the 20th ACM Inter- national Conference on Information and Knowl- edge Management, CIKM ’11, page 785–794, New 17 York, NY, USA, 2011. Association for Comput- ing Machinery

  21. [21]

    A data clustering algorithm for mining patterns from event logs

    Risto Vaarandi. A data clustering algorithm for mining patterns from event logs. InProceedings of the 3rd International Workshop on IP Oper- ations and Management, pages 119–126. IEEE, 2003

  22. [22]

    Logcluster - a data clustering and pattern mining algorithm for event logs

    Risto Vaarandi and Mauno Pihelgas. Logcluster - a data clustering and pattern mining algorithm for event logs. In2015 11th International Con- ference on Network and Service Management (CNSM), pages 1–7, 2015

  23. [23]

    Logparser- llm: Advancing efficient log parsing with large language models

    Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesheng Wu, Quanzheng Li, and Qingsong Wen. Logparser- llm: Advancing efficient log parsing with large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 4559–4570, New York, NY, USA, 2024. Asso- ciation for Computing ...

  24. [24]

    Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, and Michael R. Lyu. Loghub: A large collection of system log datasets for ai-driven log analytics. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pages 355–366, 2023. 18