Parser-Free Querying of Security Logs
Pith reviewed 2026-05-22 05:58 UTC · model grok-4.3
The pith
Sieve generates executable query code for raw security logs from natural-language questions using one LLM call and lightweight format context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By grounding a large language model with only lightweight, automatically extracted log-format context, Sieve produces executable query code from a single natural-language security question; the generated code is then executed deterministically on the raw logs. Across 133 queries on five different log types, this method reduces error rates by more than three times compared with manual analyst scripting, and the improvement is greatest on multi-line temporal and cross-event correlation queries that are central to ongoing investigations.
What carries the argument
Sieve, the pipeline that supplies an LLM with automatically extracted log-format context so that one call yields executable query code for raw files.
If this is right
- Analysts can write multi-line temporal and correlation queries directly against raw logs without first building or maintaining parsers.
- Error rates on complex cross-event security tasks fall by more than a factor of three relative to hand-written scripts.
- New log formats can be queried immediately once a short format description is extracted, removing the need for per-source engineering.
- Security teams obtain structured-query expressiveness while retaining the immediacy of working with original raw files.
Where Pith is reading between the lines
- The same lightweight-context approach could be tested on other semi-structured data sources such as network packet captures or application traces.
- Integration with existing log-collection pipelines could make format extraction fully automatic and invisible to the analyst.
- The generated queries could be cached or translated into other query languages to support mixed-tool environments.
Load-bearing premise
The LLM will reliably output correct, executable query code when given only a single natural-language question and lightweight automatically extracted log-format context, without iterative refinement or human review in most cases.
What would settle it
A new log source or query where the code Sieve produces either fails to execute or returns results that differ from the correct answer obtained by an expert analyst.
Figures
read the original abstract
Security analysts routinely query system logs to detect threats and investigate incidents, but each log source uses its own semi-structured format: logs are cheap to produce, but expensive to use. The standard approach, building per-source parsers to normalize logs into structured schemas, is powerful but requires continuous engineering effort for each new format. Querying raw logs directly with tools like grep avoids this cost, but requires analysts to know each source's message variants and cannot express the multi-line temporal queries that security investigations demand. We present Sieve, a system that generates executable query code from natural-language security questions by grounding a large language model with lightweight, automatically extracted log-format context, requiring only one LLM call per query followed by deterministic execution. Evaluating 133 security queries across 5 log types, we find that Sieve achieves over a 3x reduction in error rate on complex temporal and cross-event queries compared to manual analyst scripting, with the largest gains on the multi-line correlation tasks most critical to active investigations. Our results and benchmark provide evidence that LLM-generated code can bridge the gap between the expressiveness of structured log querying and the immediacy of working directly with raw files.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Sieve, a system for parser-free querying of security logs. It uses a large language model to generate executable query code directly from natural-language security questions, grounded by lightweight automatically extracted log-format context in a single LLM call followed by deterministic execution. The central empirical claim is that on a benchmark of 133 security queries across five log types, Sieve achieves over a 3x reduction in error rate relative to manual analyst scripting, with the largest gains on complex temporal and cross-event queries.
Significance. If the reported error-rate reduction holds under rigorous scrutiny, the work would offer a practical advance in security log analysis by eliminating the need for continuous per-source parser engineering while retaining the ability to express multi-line temporal and correlation queries. The provision of a 133-query benchmark across five log types is a concrete strength that enables future comparisons; the focus on real investigation tasks rather than synthetic queries adds relevance.
major comments (2)
- [Evaluation] Evaluation section: the abstract reports a >3x error-rate reduction on the 133-query benchmark, yet provides no description of how ground-truth oracles were constructed independently of the generated code, how partial matches or execution failures were scored, or whether multiple LLM samples were drawn per query. These details are load-bearing for interpreting the measured improvement on temporal and cross-event queries.
- [§3] §3 (system description): the claim that lightweight automatically extracted format context suffices for correct semantics on multi-line correlation tasks rests on an untested assumption that implicit ordering, field semantics, and variant message structures are adequately captured; the manuscript should include an ablation or error analysis showing where this context fails.
minor comments (2)
- [Abstract] The abstract states 'over a 3x reduction' without reporting the exact baseline and Sieve error rates or confidence intervals; adding these numbers would improve precision.
- [Evaluation] Clarify the exact definition of 'error rate' (e.g., query-level failure vs. result-set mismatch) and whether it was measured on held-out queries or the full set.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate additional methodological details and analysis as outlined.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the abstract reports a >3x error-rate reduction on the 133-query benchmark, yet provides no description of how ground-truth oracles were constructed independently of the generated code, how partial matches or execution failures were scored, or whether multiple LLM samples were drawn per query. These details are load-bearing for interpreting the measured improvement on temporal and cross-event queries.
Authors: We agree that these details are necessary for rigorous interpretation. Ground-truth oracles were constructed by security experts who independently defined expected results for each of the 133 queries on sampled log data, without reference to any LLM-generated code. Both execution failures and outputs that did not fully satisfy the query semantics (including partial matches) were scored as errors. A single LLM sample was used per query to ensure fair, deterministic comparison with the manual scripting baseline. We will add a new subsection in the Evaluation section that explicitly documents the oracle construction process, scoring rules, and sampling procedure, along with illustrative examples. revision: yes
-
Referee: [§3] §3 (system description): the claim that lightweight automatically extracted format context suffices for correct semantics on multi-line correlation tasks rests on an untested assumption that implicit ordering, field semantics, and variant message structures are adequately captured; the manuscript should include an ablation or error analysis showing where this context fails.
Authors: We acknowledge that an explicit ablation would provide stronger direct evidence. Our existing error analysis in Section 5 attributes the majority of failures on temporal and cross-event queries to LLM limitations in multi-step reasoning rather than format misinterpretation. To address the concern directly, we will add an ablation study in the revised manuscript that measures performance on the multi-line correlation subset both with and without the auto-extracted format context. This will quantify the context's contribution and identify any specific failure modes. revision: yes
Circularity Check
No circularity: empirical evaluation of LLM-generated queries
full rationale
The paper describes an empirical systems evaluation of Sieve, which generates query code via a single LLM call using lightweight auto-extracted log context. The headline result is a measured >3x error-rate reduction on a set of 133 held-out security queries across five log types, compared against manual analyst scripting. No equations, first-principles derivations, fitted parameters, or self-citation chains are invoked to produce this outcome; the improvement is reported directly from execution results on independently defined queries. The evaluation is externally falsifiable and does not reduce to any input by construction, satisfying the criteria for a self-contained, non-circular empirical claim.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sieve uses the model as a compiler: we generate code from the analyst’s prompt, and the code performs the scan, join, and aggregation at normal program speed.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the prompt gives the LLM the query together with examples or templates of the SSH message formats
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Supriya Bajpai, Athira Gopal, Chandrakant Harjpal, and Niraj Kumar. 2025. HG- InsightLog: Context Prioritization and Reduction for Question Answering with Non-Natural Language Construct Log Data. InFindings of the Association for Computational Linguistics: ACL 2025. 23679–23695. doi:10.18653/v1/2025.findings- acl.1214
- [2]
-
[3]
Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Xinyun Chen, Maxwell Lin, Nathanael Scharli, and Denny Zhou. 2024. Teaching Large Language Models to Self-Debug. InProceedings of the Twelfth International Conference on Learning Representations (ICLR)
work page 2024
-
[5]
Cybersecurity and Infrastructure Security Agency (CISA). 2025. NICE Workforce Framework for Cybersecurity (NICE Framework). https://niccs.cisa.gov/tools/ nice-framework
work page 2025
-
[6]
Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient Log Parsing Using n-Gram Dictionaries.IEEE Transactions on Software Engineering48, 879–892
work page 2020
-
[7]
Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans
-
[8]
InAdvances in Neural Information Processing Systems (NeurIPS)
UQE: A Query Engine for Unstructured Databases. InAdvances in Neural Information Processing Systems (NeurIPS)
-
[9]
Min Du and Feifei Li. 2016. Spell: Streaming Parsing of System Event Logs. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE, 859–864. Parser-Free Querying of Security Logs
work page 2016
-
[10]
Elasticsearch B.V. 2025. Elastic Security. https://www.elastic.co/security
work page 2025
-
[11]
Elasticsearch B.V. 2026. Elastic AI Assistant for Security — ES|QL query gener- ation. https://www.elastic.co/docs/solutions/security/ai/ai-assistant Official documentation; natural-language to ES|QL translation
work page 2026
-
[12]
Google. 2026. Introducing Gemini 3 Flash. https://blog.google/products-and- platforms/products/gemini/gemini-3-flash/ The variant used in this work is Gemini 3 Flash Preview
work page 2026
-
[13]
Google Cloud. 2025. Google Security Operations. https://cloud.google.com/ security/products/security-operations
work page 2025
-
[14]
Google Cloud. 2025. Supported Default Parsers | Chronicle Security Opera- tions. https://cloud.google.com/chronicle/docs/ingestion/parser-list/supported- default-parsers
work page 2025
-
[15]
Google Cloud. 2025. UDM field list. https://cloud.google.com/chronicle/docs/ reference/udm-field-list
work page 2025
-
[16]
Google Cloud. 2026. Generate search queries with Gemini — Google Security Operations. https://cloud.google.com/chronicle/docs/investigation/generate- udm-search-queries-gemini Documentation; natural-language to UDM Search translation
work page 2026
-
[17]
Google DeepMind. 2025. Gemini 3 Pro — Model Card. https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf Model card for Gemini 3 Pro; the variant used in this work is Gemini 3.1 Pro Preview
work page 2025
-
[18]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. InProceedings of the IEEE International Conference on Web Services (ICWS). IEEE, 33–40
work page 2017
- [19]
-
[20]
Yintong Huo, Yuxin Su, Cheryl Lee, and Michael R. Lyu. 2023. SemParser: A Semantic Parser for Log Analytics. InProceedings of the 45th International Conference on Software Engineering (ICSE)
work page 2023
-
[21]
IBM Security and Ponemon Institute. 2024. Cost of a Data Breach Re- port 2024. https://newsroom.ibm.com/2024-07-30-ibm-report-escalating-data- breach-disruption-pushes-costs-to-new-highs
work page 2024
-
[22]
Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R. Lyu. 2024. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. InProceedings of the ACM International Conference on the Foundations of Software Engineering (FSE)
work page 2024
-
[23]
Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R. Lyu. 2024. A Large-Scale Evalua- tion for Log Parsing Techniques: How Far Are We?. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)
work page 2024
-
[24]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. InInternational Conference on Learning Representa- tions (ICLR)
work page 2024
-
[25]
Egil Karlsen, Xiao Luo, Nur Zincir-Heywood, and Malcolm Heywood. 2024. Benchmarking Large Language Models for Log Analysis, Security, and Interpre- tation.Journal of Network and Systems Management(2024)
work page 2024
-
[26]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Informa- tion Processing Systems (NeurIPS)
work page 2020
- [27]
- [28]
- [29]
-
[30]
Clustering Event Logs Using Iterative Partitioning. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM
-
[31]
Microsoft. 2026. Natural language to KQL for Microsoft Sentinel. https://learn. microsoft.com/en-us/copilot/security/plugin-kql Microsoft Security Copilot plugin documentation; natural-language to KQL translation
work page 2026
-
[32]
OCSF. 2025. OCSF Schema – Open Cybersecurity Schema Framework. https: //schema.ocsf.io/1.4.0/
work page 2025
-
[33]
Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama
Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2024. Is Self-Repair a Silver Bullet for Code Generation?. InProceedings of the Twelfth International Conference on Learning Representations (ICLR)
work page 2024
-
[34]
OpenAI. 2026. GPT-5.4 Thinking System Card. https://openai.com/index/gpt-5- 4-thinking-system-card/
work page 2026
- [35]
-
[36]
Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In- Context Learning of Text-to-SQL with Self-Correction. InAdvances in Neural Information Processing Systems (NeurIPS)
work page 2023
-
[37]
Red Hat. 2025. Red Hat Bugzilla. https://bugzilla.redhat.com/
work page 2025
- [38]
-
[39]
SANS Institute. 2025. SANS 2025 SOC Survey. https://www.sans.org/white- papers/sans-2025-soc-survey
work page 2025
-
[40]
Splunk Inc. 2025. Splunk. https://www.splunk.com/
work page 2025
-
[41]
Splunk Inc. 2025. Splunk Supported Add-ons Documentation. https://docs.splunk.com/Documentation/AddOns/released/Overview/ AboutSplunkAdd-ons
work page 2025
-
[42]
Splunk Inc. 2026. Splunk AI Assistant for SPL. https://www.splunk.com/en_us/ products/splunk-ai-assistant-for-spl.html Official product page; bidirectional natural-language to SPL translation
work page 2026
-
[43]
Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating System Events from Raw Textual Logs. InProceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). ACM, 785–794
work page 2011
-
[44]
Department of Homeland Security
U.S. Department of Homeland Security. 2012.The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Technical Report. Department of Homeland Security. https://www.dhs.gov/sites/default/files/ publications/CSD-MenloPrinciplesCORE-20120803_1.pdf
work page 2012
-
[45]
Risto Vaarandi and Hayretdin Bahsi. 2025. Using Large Language Models for Tem- plate Detection from Security Event Logs.International Journal of Information Security24 (2025). doi:10.1007/s10207-025-01018-y
-
[46]
Risto Vaarandi and Mauno Pihelgas. 2015. LogCluster – A Data Clustering and Pattern Mining Algorithm for Event Logs. InProceedings of the 11th International Conference on Network and Service Management (CNSM). IEEE/IFIP
work page 2015
-
[47]
Junjielong Xu, Ruichun Yang, Yintong Huo, Chengyu Zhang, and Pinjia He. 2024. DivLog: Log Parsing with Prompt Enhanced In-Context Learning. InProceedings of the 46th International Conference on Software Engineering (ICSE)
work page 2024
-
[48]
Siyu Yu, Pinjia He, Ningjiang Chen, and Yifan Wu. 2023. Brain: Log Parsing With Bidirectional Parallel Tree.IEEE Transactions on Services Computing16, 5 (2023)
work page 2023
-
[49]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...
work page 2018
-
[50]
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practi- cal Code Generation: Phenomena, Mechanism, and Mitigation. InProceedings of the ACM on Software Engineering (ISSTA)
work page 2025
-
[51]
Aoxiao Zhong, Dengyao Mo, Guiyang Liu, Jinbu Liu, Qingda Lu, Qi Zhou, Jiesh- eng Wu, Quanzheng Li, and Qingsong Wen. 2024. LogParser-LLM: Advancing Efficient Log Parsing with Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
work page 2024
-
[52]
For each host, count SSH session opens and closes, and compute unclosed sessions
Jieming Zhu, Pinjia He, Zibin Zheng, and Michael R. Lyu. 2023. LogHub 2.0: A Large-Scale Collection of Log Datasets for AI-Driven Log Analytics.arXiv preprint arXiv:2308.09003(2023). A Ethical Considerations Stakeholders.Our work potentially impacts multiple groups: (i) security analysts and organizations that rely on log data for de- tection and forensic...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.