pith. sign in

arxiv: 2605.22027 · v1 · pith:UGDXGB5Pnew · submitted 2026-05-21 · 💻 cs.CR

Parser-Free Querying of Security Logs

Pith reviewed 2026-05-22 05:58 UTC · model grok-4.3

classification 💻 cs.CR
keywords security logsnatural language queryingLLM code generationparser-free analysisincident investigationtemporal log queries
0
0 comments X

The pith

Sieve generates executable query code for raw security logs from natural-language questions using one LLM call and lightweight format context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Security analysts must investigate threats by querying logs that arrive in many incompatible semi-structured formats. Building dedicated parsers for each format demands ongoing engineering work, while direct raw-file searches with basic tools cannot express the multi-line temporal correlations that real investigations require. Sieve instead feeds an LLM a short, automatically extracted description of the log layout plus the analyst's plain-English question and receives back ready-to-run query code. The paper evaluates this approach on 133 queries spanning five log types and reports more than a threefold drop in error rate relative to manual scripting, with the largest gains on the complex cross-event tasks that matter most during active incidents.

Core claim

By grounding a large language model with only lightweight, automatically extracted log-format context, Sieve produces executable query code from a single natural-language security question; the generated code is then executed deterministically on the raw logs. Across 133 queries on five different log types, this method reduces error rates by more than three times compared with manual analyst scripting, and the improvement is greatest on multi-line temporal and cross-event correlation queries that are central to ongoing investigations.

What carries the argument

Sieve, the pipeline that supplies an LLM with automatically extracted log-format context so that one call yields executable query code for raw files.

If this is right

  • Analysts can write multi-line temporal and correlation queries directly against raw logs without first building or maintaining parsers.
  • Error rates on complex cross-event security tasks fall by more than a factor of three relative to hand-written scripts.
  • New log formats can be queried immediately once a short format description is extracted, removing the need for per-source engineering.
  • Security teams obtain structured-query expressiveness while retaining the immediacy of working with original raw files.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight-context approach could be tested on other semi-structured data sources such as network packet captures or application traces.
  • Integration with existing log-collection pipelines could make format extraction fully automatic and invisible to the analyst.
  • The generated queries could be cached or translated into other query languages to support mixed-tool environments.

Load-bearing premise

The LLM will reliably output correct, executable query code when given only a single natural-language question and lightweight automatically extracted log-format context, without iterative refinement or human review in most cases.

What would settle it

A new log source or query where the code Sieve produces either fails to execute or returns results that differ from the correct answer obtained by an expert analyst.

Figures

Figures reproduced from arXiv: 2605.22027 by David Wagner, Evan Luo, Julien Piet.

Figure 1
Figure 1. Figure 1: Sieve pipeline. A natural-language query and a log file [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Error rate (1 − 𝐹1) per dataset. Sieve+Drain3 reduces complex-tier error by 41–91% vs. the human baseline. an expensive offline parsing step. Sieve approaches the accuracy of structured querying while outperforming manual scripting, combin￾ing the strengths of both: no preprocessing and low-effort queries. Human baseline [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human baseline vs. Sieve+Drain3, 𝐹1 per dataset [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt ablation: 𝐹1 when each component of the prompt (§4.1) is removed. Retries contribute most; templates help most on specialized formats [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-query 𝐹1 distributions across 𝑁=5 runs. Template-based strategies are tightly concentrated (median 𝜎 < 0.01); “none” shows the widest spread. The repeated-run analysis yields three conclusions. First, provid￾ing any form of context (templates or samples) produces a large, consistent improvement over no context: an 𝐹1 gain of about 0.39 on Audit S (Matryoshka vs. None), with the 95% confidence interval … view at source ↗
read the original abstract

Security analysts routinely query system logs to detect threats and investigate incidents, but each log source uses its own semi-structured format: logs are cheap to produce, but expensive to use. The standard approach, building per-source parsers to normalize logs into structured schemas, is powerful but requires continuous engineering effort for each new format. Querying raw logs directly with tools like grep avoids this cost, but requires analysts to know each source's message variants and cannot express the multi-line temporal queries that security investigations demand. We present Sieve, a system that generates executable query code from natural-language security questions by grounding a large language model with lightweight, automatically extracted log-format context, requiring only one LLM call per query followed by deterministic execution. Evaluating 133 security queries across 5 log types, we find that Sieve achieves over a 3x reduction in error rate on complex temporal and cross-event queries compared to manual analyst scripting, with the largest gains on the multi-line correlation tasks most critical to active investigations. Our results and benchmark provide evidence that LLM-generated code can bridge the gap between the expressiveness of structured log querying and the immediacy of working directly with raw files.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Sieve, a system for parser-free querying of security logs. It uses a large language model to generate executable query code directly from natural-language security questions, grounded by lightweight automatically extracted log-format context in a single LLM call followed by deterministic execution. The central empirical claim is that on a benchmark of 133 security queries across five log types, Sieve achieves over a 3x reduction in error rate relative to manual analyst scripting, with the largest gains on complex temporal and cross-event queries.

Significance. If the reported error-rate reduction holds under rigorous scrutiny, the work would offer a practical advance in security log analysis by eliminating the need for continuous per-source parser engineering while retaining the ability to express multi-line temporal and correlation queries. The provision of a 133-query benchmark across five log types is a concrete strength that enables future comparisons; the focus on real investigation tasks rather than synthetic queries adds relevance.

major comments (2)
  1. [Evaluation] Evaluation section: the abstract reports a >3x error-rate reduction on the 133-query benchmark, yet provides no description of how ground-truth oracles were constructed independently of the generated code, how partial matches or execution failures were scored, or whether multiple LLM samples were drawn per query. These details are load-bearing for interpreting the measured improvement on temporal and cross-event queries.
  2. [§3] §3 (system description): the claim that lightweight automatically extracted format context suffices for correct semantics on multi-line correlation tasks rests on an untested assumption that implicit ordering, field semantics, and variant message structures are adequately captured; the manuscript should include an ablation or error analysis showing where this context fails.
minor comments (2)
  1. [Abstract] The abstract states 'over a 3x reduction' without reporting the exact baseline and Sieve error rates or confidence intervals; adding these numbers would improve precision.
  2. [Evaluation] Clarify the exact definition of 'error rate' (e.g., query-level failure vs. result-set mismatch) and whether it was measured on held-out queries or the full set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional methodological details and analysis as outlined.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the abstract reports a >3x error-rate reduction on the 133-query benchmark, yet provides no description of how ground-truth oracles were constructed independently of the generated code, how partial matches or execution failures were scored, or whether multiple LLM samples were drawn per query. These details are load-bearing for interpreting the measured improvement on temporal and cross-event queries.

    Authors: We agree that these details are necessary for rigorous interpretation. Ground-truth oracles were constructed by security experts who independently defined expected results for each of the 133 queries on sampled log data, without reference to any LLM-generated code. Both execution failures and outputs that did not fully satisfy the query semantics (including partial matches) were scored as errors. A single LLM sample was used per query to ensure fair, deterministic comparison with the manual scripting baseline. We will add a new subsection in the Evaluation section that explicitly documents the oracle construction process, scoring rules, and sampling procedure, along with illustrative examples. revision: yes

  2. Referee: [§3] §3 (system description): the claim that lightweight automatically extracted format context suffices for correct semantics on multi-line correlation tasks rests on an untested assumption that implicit ordering, field semantics, and variant message structures are adequately captured; the manuscript should include an ablation or error analysis showing where this context fails.

    Authors: We acknowledge that an explicit ablation would provide stronger direct evidence. Our existing error analysis in Section 5 attributes the majority of failures on temporal and cross-event queries to LLM limitations in multi-step reasoning rather than format misinterpretation. To address the concern directly, we will add an ablation study in the revised manuscript that measures performance on the multi-line correlation subset both with and without the auto-extracted format context. This will quantify the context's contribution and identify any specific failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of LLM-generated queries

full rationale

The paper describes an empirical systems evaluation of Sieve, which generates query code via a single LLM call using lightweight auto-extracted log context. The headline result is a measured >3x error-rate reduction on a set of 133 held-out security queries across five log types, compared against manual analyst scripting. No equations, first-principles derivations, fitted parameters, or self-citation chains are invoked to produce this outcome; the improvement is reported directly from execution results on independently defined queries. The evaluation is externally falsifiable and does not reduce to any input by construction, satisfying the criteria for a self-contained, non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the assumption that a single LLM call with lightweight format context is sufficient to produce reliable executable queries. No free parameters, axioms, or invented entities are visible in the abstract.

pith-pipeline@v0.9.0 · 5724 in / 1206 out tokens · 23625 ms · 2026-05-22T05:58:34.737406+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    Supriya Bajpai, Athira Gopal, Chandrakant Harjpal, and Niraj Kumar. 2025. HG- InsightLog: Context Prioritization and Reduction for Question Answering with Non-Natural Language Construct Log Data. InFindings of the Association for Computational Linguistics: ACL 2025. 23679–23695. doi:10.18653/v1/2025.findings- acl.1214

  2. [2]

    Viktor Beck, Max Landauer, Markus Wurzenberger, Florian Skopik, and Andreas Rauber. 2025. System Log Parsing with Large Language Models: A Review.arXiv preprint arXiv:2504.04877(2025)

  3. [3]

    Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

  4. [4]

    Xinyun Chen, Maxwell Lin, Nathanael Scharli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the Twelfth International Conference on Learning Representations (ICLR)

  5. [5]

    Cybersecurity and Infrastructure Security Agency (CISA). 2025. NICE Workforce Framework for Cybersecurity (NICE Framework). https://niccs.cisa.gov/tools/ nice-framework

  6. [6]

    Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient Log Parsing Using n-Gram Dictionaries.IEEE Transactions on Software Engineering48, 879–892

  7. [7]

    Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans

  8. [8]

    InAdvances in Neural Information Processing Systems (NeurIPS)

    UQE: A Query Engine for Unstructured Databases. InAdvances in Neural Information Processing Systems (NeurIPS)

  9. [9]

    Min Du and Feifei Li. 2016. Spell: Streaming Parsing of System Event Logs. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE, 859–864. Parser-Free Querying of Security Logs

  10. [10]

    Elasticsearch B.V. 2025. Elastic Security. https://www.elastic.co/security

  11. [11]

    Elasticsearch B.V. 2026. Elastic AI Assistant for Security — ES|QL query gener- ation. https://www.elastic.co/docs/solutions/security/ai/ai-assistant Official documentation; natural-language to ES|QL translation

  12. [12]

    Google. 2026. Introducing Gemini 3 Flash. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/ The variant used in this work is Gemini 3 Flash Preview

  13. [13]

    Google Cloud. 2025. Google Security Operations. https://cloud.google.com/ security/products/security-operations

  14. [14]

    Google Cloud. 2025. Supported Default Parsers | Chronicle Security Opera- tions. https://cloud.google.com/chronicle/docs/ingestion/parser-list/supported- default-parsers

  15. [15]

    Google Cloud. 2025. UDM field list. https://cloud.google.com/chronicle/docs/ reference/udm-field-list

  16. [16]

    Google Cloud. 2026. Generate search queries with Gemini — Google Security Operations. https://cloud.google.com/chronicle/docs/investigation/generate- udm-search-queries-gemini Documentation; natural-language to UDM Search translation

  17. [17]

    Google DeepMind. 2025. Gemini 3 Pro — Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf Model card for Gemini 3 Pro; the variant used in this work is Gemini 3.1 Pro Preview

  18. [18]

    Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. InProceedings of the IEEE International Conference on Web Services (ICWS). IEEE, 33–40

  19. [19]

    Shaohan Huang, Yi Liu, Carol Fung, Jiaxing Qi, Hailong Yang, and Zhongzhi Luan. 2023. LogQA: Question Answering in Unstructured Logs.arXiv preprint arXiv:2303.11715(2023)

  20. [20]

    Yintong Huo, Yuxin Su, Cheryl Lee, and Michael R. Lyu. 2023. SemParser: A Semantic Parser for Log Analytics. InProceedings of the 45th International Conference on Software Engineering (ICSE)

  21. [21]

    IBM Security and Ponemon Institute. 2024. Cost of a Data Breach Re- port 2024. https://newsroom.ibm.com/2024-07-30-ibm-report-escalating-data- breach-disruption-pushes-costs-to-new-highs

  22. [22]

    Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R. Lyu. 2024. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. InProceedings of the ACM International Conference on the Foundations of Software Engineering (FSE)

  23. [23]

    Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R. Lyu. 2024. A Large-Scale Evalua- tion for Log Parsing Techniques: How Far Are We?. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)

  24. [24]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representa- tions (ICLR)

  25. [25]

    Egil Karlsen, Xiao Luo, Nur Zincir-Heywood, and Malcolm Heywood. 2024. Benchmarking Large Language Models for Log Analysis, Security, and Interpre- tation.Journal of Network and Systems Management(2024)

  26. [26]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Informa- tion Processing Systems (NeurIPS)

  27. [27]

    Chenyu Li, Zhengjia Zhu, Jiyan He, and Xiu Zhang. 2025. RedChronos: A Large Language Model-Based Log Analysis System for Insider Threat Detection in Enterprises.arXiv preprint arXiv:2503.02702(2025)

  28. [28]

    Zeyang Ma, Dong Jae Kim, and Tse-Hsun Chen. 2024. LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models. arXiv preprint arXiv:2408.01585(2024)

  29. [29]

    Makanju, A

    Adetokunbo A.O. Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios

  30. [30]

    InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)

    Clustering Event Logs Using Iterative Partitioning. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM

  31. [31]

    Microsoft. 2026. Natural language to KQL for Microsoft Sentinel. https://learn. microsoft.com/en-us/copilot/security/plugin-kql Microsoft Security Copilot plugin documentation; natural-language to KQL translation

  32. [32]

    OCSF. 2025. OCSF Schema – Open Cybersecurity Schema Framework. https: //schema.ocsf.io/1.4.0/

  33. [33]

    Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama

    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is Self-Repair a Silver Bullet for Code Generation?. InProceedings of the Twelfth International Conference on Learning Representations (ICLR)

  34. [34]

    OpenAI. 2026. GPT-5.4 Thinking System Card. https://openai.com/index/gpt-5- 4-thinking-system-card/

  35. [35]

    Julien Piet, Vivian Fang, Rishi Khare, Scott Coull, Vern Paxson, Raluca Ada Popa, and David Wagner. 2025. Semantic-Aware Parsing for Security Logs.arXiv preprint arXiv:2506.17512(2025)

  36. [36]

    Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In- Context Learning of Text-to-SQL with Self-Correction. InAdvances in Neural Information Processing Systems (NeurIPS)

  37. [37]

    Red Hat. 2025. Red Hat Bugzilla. https://bugzilla.redhat.com/

  38. [38]

    Md Hasan Saju, Austin Page, Akramul Azim, Jeff Gardiner, Farzaneh Abazari, and Frank Eargle. 2025. SynRAG: A Large Language Model Framework for Executable Query Generation in Heterogeneous SIEM System.arXiv preprint arXiv:2512.24571(2025)

  39. [39]

    SANS Institute. 2025. SANS 2025 SOC Survey. https://www.sans.org/white- papers/sans-2025-soc-survey

  40. [40]

    Splunk Inc. 2025. Splunk. https://www.splunk.com/

  41. [41]

    Splunk Inc. 2025. Splunk Supported Add-ons Documentation. https://docs.splunk.com/Documentation/AddOns/released/Overview/ AboutSplunkAdd-ons

  42. [42]

    Splunk Inc. 2026. Splunk AI Assistant for SPL. https://www.splunk.com/en_us/ products/splunk-ai-assistant-for-spl.html Official product page; bidirectional natural-language to SPL translation

  43. [43]

    Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating System Events from Raw Textual Logs. InProceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). ACM, 785–794

  44. [44]

    Department of Homeland Security

    U.S. Department of Homeland Security. 2012.The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Technical Report. Department of Homeland Security. https://www.dhs.gov/sites/default/files/ publications/CSD-MenloPrinciplesCORE-20120803_1.pdf

  45. [45]

    Risto Vaarandi and Hayretdin Bahsi. 2025. Using Large Language Models for Tem- plate Detection from Security Event Logs.International Journal of Information Security24 (2025). doi:10.1007/s10207-025-01018-y

  46. [46]

    Risto Vaarandi and Mauno Pihelgas. 2015. LogCluster – A Data Clustering and Pattern Mining Algorithm for Event Logs. InProceedings of the 11th International Conference on Network and Service Management (CNSM). IEEE/IFIP

  47. [47]

    Junjielong Xu, Ruichun Yang, Yintong Huo, Chengyu Zhang, and Pinjia He. 2024. DivLog: Log Parsing with Prompt Enhanced In-Context Learning. InProceedings of the 46th International Conference on Software Engineering (ICSE)

  48. [48]

    Siyu Yu, Pinjia He, Ningjiang Chen, and Yifan Wu. 2023. Brain: Log Parsing With Bidirectional Parallel Tree.IEEE Transactions on Services Computing16, 5 (2023)

  49. [49]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

  50. [50]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practi- cal Code Generation: Phenomena, Mechanism, and Mitigation. InProceedings of the ACM on Software Engineering (ISSTA)

  51. [51]

    Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. LogParser-LLM: Advancing Efficient Log Parsing with Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

  52. [52]

    For each host, count SSH session opens and closes, and compute unclosed sessions

    Jieming Zhu, Pinjia He, Zibin Zheng, and Michael R. Lyu. 2023. LogHub 2.0: A Large-Scale Collection of Log Datasets for AI-Driven Log Analytics.arXiv preprint arXiv:2308.09003(2023). A Ethical Considerations Stakeholders.Our work potentially impacts multiple groups: (i) security analysts and organizations that rely on log data for de- tection and forensic...