Recognition: unknown
Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
Pith reviewed 2026-05-10 00:43 UTC · model grok-4.3
The pith
LLMs detect logging security issues with moderate accuracy but struggle to generate reliable code repairs on a benchmark of 101 real cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors derive a taxonomy of logging code security issues with four common categories and ten patterns. They assemble a benchmark of 101 real-world reports after manual review and annotation. An automated framework then evaluates LLMs on detection and repair, revealing moderate detection performance with accuracies from 12.9 percent to 52.5 percent on average while exposing clear difficulties in producing correct repairs. Issue descriptions alone raise detection accuracy more effectively than security pattern explanations or their combination.
What carries the argument
The taxonomy of four categories and ten logging security patterns, paired with the manually annotated benchmark of 101 real-world reports, inside a prompt-based evaluation framework that supplies different contextual knowledge to LLMs.
If this is right
- The taxonomy supplies developers with specific patterns to audit logging code for leaks or injection risks.
- Issue descriptions alone can be prioritized in prompts to raise LLM detection rates without adding pattern explanations.
- LLMs can serve as initial detectors for logging issues but require human review to ensure repair quality.
- The benchmark enables repeated testing of future models on the same set of real cases.
Where Pith is reading between the lines
- The same taxonomy-plus-benchmark method could be applied to security issues in other code areas such as authentication or data serialization.
- Training data that emphasizes issue descriptions over abstract patterns may improve LLM performance on security-related code tasks.
- The performance gap between detection and repair suggests that separate models or stages could be used for each step in secure-logging tools.
Load-bearing premise
The 101 manually reviewed and annotated real-world logging security issue reports form a representative set that captures the full range of practical problems.
What would settle it
A larger or independently collected set of logging issues where LLMs produce correct repairs for most cases, or where detection accuracy falls outside the reported 12.9 to 52.5 percent range.
Figures
read the original abstract
Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring. However, insecure logging practices can inadvertently expose sensitive information or enable attacks such as log injection, posing serious threats to system security and privacy. Prior research has examined general defects in logging code, but systematic analysis of logging code security issues remains limited, particularly in leveraging LLMs for detection and repair. In this paper, we derive a comprehensive taxonomy of logging code security issues, encompassing four common issue categories and 10 corresponding patterns. We further construct a benchmark dataset with 101 real-world logging security issue reports that have been manually reviewed and annotated. We then propose an automated framework that incorporates various contextual knowledge to evaluate LLMs' capabilities in detecting and repairing logging security issues. Our experimental results reveal a notable disparity in performance: while LLMs are moderately effective at detecting security issues (e.g., the accuracy ranges from 12.9% to 52.5% on average), they face noticeable challenges in reliably generating correct code repairs. We also find that the issue description alone improves the LLMs' detection accuracy more than the security pattern explanation or a combination of both. Overall, our findings provide actionable insights for practitioners and highlight the potential and limitations of current LLMs for secure logging.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives a taxonomy of logging code security issues with four categories and ten patterns, constructs a benchmark of 101 real-world reports that were manually reviewed and annotated, and evaluates LLMs on detection and repair tasks using different combinations of issue descriptions and security pattern explanations as context. It reports moderate detection performance (accuracies from 12.9% to 52.5%) but poor repair success, and finds that supplying only the issue description outperforms using security patterns or both.
Significance. If the ground-truth labels are reliable, the work provides a useful characterization of an under-studied class of security issues in logging code and the first benchmark for assessing LLM assistance in this area. The comparative evaluation of contextual prompts yields actionable insights for practitioners. The construction of an external, real-world benchmark is a positive aspect that could support follow-on research if the dataset and annotation details are made available.
major comments (2)
- [Benchmark dataset section] Benchmark dataset section: The paper states that the 101 reports 'have been manually reviewed and annotated' against the 4-category/10-pattern taxonomy, but supplies no details on annotator count, inter-annotator agreement (e.g., Cohen's kappa), disagreement resolution, selection criteria for the reports, or external validation. All reported detection accuracies, repair results, and the claim that 'issue description alone improves the LLMs' detection accuracy more than the security pattern explanation' rest on the correctness of these labels; the absence of this information makes the central empirical claims unverifiable.
- [Experimental evaluation section] Experimental evaluation section: The manuscript reports accuracy ranges and comparative prompt results but does not specify the exact LLM versions, full prompting templates, number of runs to account for stochasticity, or any statistical tests for differences between conditions. These omissions directly affect reproducibility and the strength of the comparative finding.
minor comments (2)
- [Abstract and Results] The abstract and results text should clarify what 'on average' refers to when stating the 12.9%–52.5% accuracy range (across models, categories, or runs).
- [Taxonomy section] The derivation process for the taxonomy (how the four categories and ten patterns were obtained from prior logging-defect literature) could be described more explicitly, with additional citations to related work on logging defects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The points raised are valid and will help improve the clarity, verifiability, and reproducibility of our work. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Benchmark dataset section] Benchmark dataset section: The paper states that the 101 reports 'have been manually reviewed and annotated' against the 4-category/10-pattern taxonomy, but supplies no details on annotator count, inter-annotator agreement (e.g., Cohen's kappa), disagreement resolution, selection criteria for the reports, or external validation. All reported detection accuracies, repair results, and the claim that 'issue description alone improves the LLMs' detection accuracy more than the security pattern explanation' rest on the correctness of these labels; the absence of this information makes the central empirical claims unverifiable.
Authors: We agree that the absence of these details limits the verifiability of the ground-truth labels. In the revised manuscript, we will expand the Benchmark dataset section with the requested information: annotator count (two authors annotated independently), inter-annotator agreement (Cohen's kappa will be computed and reported), disagreement resolution (via discussion to reach consensus), selection criteria (reports drawn from GitHub issues and public vulnerability databases using targeted keywords for logging security issues), and external validation steps. These additions will directly support the reliability of our detection and repair results without altering the original findings. revision: yes
-
Referee: [Experimental evaluation section] Experimental evaluation section: The manuscript reports accuracy ranges and comparative prompt results but does not specify the exact LLM versions, full prompting templates, number of runs to account for stochasticity, or any statistical tests for differences between conditions. These omissions directly affect reproducibility and the strength of the comparative finding.
Authors: We concur that these omissions hinder reproducibility and weaken the comparative claims. In the revision, we will update the Experimental evaluation section to specify exact LLM versions (e.g., GPT-4-turbo, Claude-3), include full prompting templates (in an appendix), report the number of runs (three runs per condition with averaged results to address stochasticity), and add statistical tests (e.g., McNemar's test or Wilcoxon signed-rank test) for differences across prompt conditions. This will make the evaluation fully reproducible and provide stronger support for the finding that issue descriptions outperform other contexts. revision: yes
Circularity Check
Empirical benchmark study with no self-referential derivations or fitted predictions
full rationale
The paper derives a taxonomy from manual review of real-world reports, constructs an external benchmark of 101 annotated logging security issues, and then runs standard LLM detection/repair experiments against that benchmark. No equations, parameters, or predictions appear; the reported accuracies (12.9–52.5 %) and comparative prompt findings are direct empirical measurements on independently labeled external data rather than quantities forced by construction from the inputs. No self-citation chains, ansatzes, or renamings of known results are load-bearing for the central claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Manually reviewed and annotated real-world reports accurately and representatively capture logging security issues
Reference graph
Works this paper leans on
-
[1]
CVE: Common Vulnerabilities and Exposures
2025. CVE: Common Vulnerabilities and Exposures. https://www.cve.org/. Last accessed September 2025
2025
-
[2]
Log4Shell
2025. Log4Shell. https://en.wikipedia.org/wiki/Log4Shell. Last accessed September 2025
2025
-
[3]
MITRE - CWE List
2025. MITRE - CWE List. https://cwe.mitre.org/data/index.html. Last accessed September 2025
2025
-
[4]
2025. Ollama. https://ollama.com. Last accessed September 2025
2025
-
[5]
Replicaton Package
2026. Replicaton Package. https://github.com/defects4log/SecLogging. Last accessed April 2026
2026
-
[6]
Roozbeh Aghili, Heng Li, and Foutse Khomh. 2025. Protecting Privacy in Software Logs: What Should Be Anonymized? Proceedings of the ACM on Software Engineering 2, FSE (2025), 1317–1338
2025
- [7]
-
[8]
Adil Ahmad, Sangho Lee, and Marcus Peinado. 2022. Hardlog: Practical tamper-proof system auditing using a novel audit device. In 2022 IEEE Symposium on Security and Privacy (SP) . IEEE, 1791–1807
2022
- [9]
-
[10]
Jeanderson Cândido, Maurício Aniche, and Arie Van Deursen. 2021. Log-based software monitoring: a systematic mapping study. PeerJ Computer Science 7 (2021), e489
2021
-
[11]
Boyuan Chen and Zhen Ming Jiang. 2017. Characterizing and Detecting Anti-Patterns in the Logging Code. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)
2017
-
[12]
Haoyu Chen, Shanshan Tu, Chunye Zhao, and Yongfeng Huang. 2016. Provenance cloud security auditing system based on log analysis. In 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS) . IEEE, 155–159
2016
-
[13]
Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo. 2024. Code search is all you need? improving code suggestions with code search. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE). 1–13
2024
-
[14]
Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. 2025. SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios. arXiv preprint arXiv:2509.22097 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia. 2025. Understanding practitioners’ expectations on clear code review comments. Proceedings of the ACM on Software Engineering 2, ISSTA (2025), 1257–1279
2025
-
[16]
DeepSeek-AI, Daya Guo, Dejian Yang, and et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
2025
-
[17]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, and et al. Wang. 2025. DeepSeek-V3 Technical Report. doi:10.48550/arXiv. 2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[18]
Zishuo Ding, Yiming Tang, Yang Li, Heng Li, and Weiyi Shang. 2023. On the temporal relations between logging and code. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 843–854
2023
-
[19]
Zishuo Ding, Yiming Tang, Yang Li, Heng Li, and Weiyi Shang. 2023. On the Temporal Relations between Logging and Code. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)
2023
-
[20]
Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th international conference on mining software repositories . 508–512
2020
-
[21]
Qiang Fu, Jieming Zhu, Wenlu Hu, Jian-Guang Lou, Rui Ding, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Where do developers log? an empirical study on logging practices in industry. In Companion Proceedings of the 36th International Conference on Software Engineering . 24–33
2014
-
[22]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and et al. 2024. The Llama 3 Herd of Models
2024
-
[23]
Mehran Hassani, Weiyi Shang, Emad Shihab, and Nikolaos Tsantalis. 2018. Studying and detecting log-related issues. Empirical Softw. Engg. (2018). Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE034. Publication date: July 2026. FSE034:20 He Yang Yuan, Xin Wang, Kundi Yao, An Ran Chen, Zishuo Ding, and Zhenhao Li
2018
- [24]
- [25]
-
[26]
Xing Hu, Xin Xia, David Lo, Zhiyuan Wan, Qiuyuan Chen, and Thomas Zimmermann. 2022. Practitioners’ Expectations on Automated Code Comment Generation. In Proceedings of the 44th international conference on software engineering . 1693–1705
2022
-
[27]
Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. 2024. Lilac: Log parsing using llms with adaptive parsing cache. Proceedings of the ACM on Software Engineering 1, FSE (2024), 137–160
2024
-
[28]
Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R Lyu. 2024. A large-scale evaluation for log parsing techniques: How far are we?. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis . 223–234
2024
-
[29]
Staffs Keele et al. 2007. Guidelines for performing systematic literature reviews in software engineering . Technical Report. Technical report, ver. 2.3 ebse technical report. ebse
2007
-
[30]
Amirmahdi Khosravi Tabrizi, Naser Ezzati-Jivan, and Francois Tetreault. 2024. An Adaptive Logging System (ALS): En- hancing Software Logging with Reinforcement Learning Techniques. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering . 37–47
2024
-
[31]
Heng Li, Weiyi Shang, Bram Adams, Mohammed Sayagh, and Ahmed E Hassan. 2020. A qualitative study of the benefits and costs of logging from developers’ perspectives. IEEE Transactions on Software Engineering 47, 12 (2020), 2858–2873
2020
-
[32]
Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2024. Go Static: Contextualized Logging Statement Generation. Proc. ACM Softw. Eng. (2024)
2024
-
[33]
Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303
2021
-
[34]
Zhenhao Li, An Ran Chen, Xing Hu, Xin Xia, Tse-Hsun Chen, and Weiyi Shang. 2023. Are they all good? studying practitioners’ expectations on the readability of log messages. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 129–140
2023
-
[35]
Zhenhao Li, Tse-Hsun Chen, and Weiyi Shang. 2020. Where Shall We Log? Studying and Suggesting Logging Locations in Code Blocks. In 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 . 361–372
2020
-
[36]
Zhenhao Li, Tse-Hsun Chen, Jinqiu Yang, and Weiyi Shang. 2019. DLFinder: Characterizing and Detecting Duplicate Logging Code Smells. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)
2019
-
[37]
Zhenhao Li, Heng Li, Tse-Hsun Chen, and Weiyi Shang. 2021. Deeplv: Suggesting log levels using ordinal based neural networks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) . IEEE, 1461–1472
2021
-
[38]
Zhenhao Li, Chuan Luo, Tse-Hsun Chen, Weiyi Shang, Shilin He, Qingwei Lin, and Dongmei Zhang. 2023. Did we miss something important? Studying and exploring variable-aware log abstraction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . 830–842
2023
-
[39]
Lipeng Ma, Weidong Yang, Bo Xu, Sihang Jiang, Ben Fei, Jiaqing Liang, Mingjie Zhou, and Yanghua Xiao. 2024. Knowlog: Knowledge enhanced pre-trained language model for log understanding. In Proceedings of the 46th ieee/acm international conference on software engineering . 1–13
2024
- [40]
-
[41]
Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. 2025. Towards explainable vulnerability detection with large language models. IEEE Transactions on Software Engineering (2025)
2025
-
[42]
Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using deep learning to generate complete log statements. In Proceedings of the 44th international conference on software engineering . 2279–2290
2022
-
[43]
Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica 22, 3 (2012), 276–282
2012
-
[44]
Andriy Miranskyy, Abdelwahab Hamou-Lhadj, Enzo Cialini, and Alf Larsson. 2016. Operational-log analysis for big data systems: Challenges and solutions. IEEE Software 33, 2 (2016), 52–59
2016
-
[45]
Zulie Pan, Yu Chen, Yuanchao Chen, Yi Shen, and Yang Li. 2022. LogInjector: Detecting web application log injection vulnerabilities. Applied Sciences 12, 15 (2022), 7681
2022
-
[46]
Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. 2025. Coffe: A code efficiency benchmark for code generation. Proceedings of the ACM on Software Engineering 2, FSE (2025), 242–265
2025
-
[47]
Qwen, An Yang, Baosong Yang, Beichen Zhang, and et al. 2025. Qwen2.5 Technical Report. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE034. Publication date: July 2026. Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs FSE034:21
2025
- [48]
-
[49]
Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. 2022. What makes a good commit message?. In Proceedings of the 44th International Conference on Software Engineering . 2389–2401
2022
-
[50]
Xin Wang, Zhenhao Li, and Zishuo Ding. 2025. Defects4Log: Benchmarking LLMs for Logging Code Defect Detection and Reasoning. In 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025 . IEEE, 1931–1942
2025
-
[51]
Xin Wang, Zhenhao Li, and Zishuo Ding. 2026. LLM4Perf: Large Language Models Are Effective Samplers for Multi- Objective Performance Modeling. In Proceedings of the IEEE/ACM 48th International Conference on Software Engineering (ICSE)
2026
- [52]
-
[53]
Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering . 1–10
2014
-
[54]
Yi Xiao, Van-Hoang Le, and Hongyu Zhang. 2024. free: Towards more practical log parsing with large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering . 153–165
2024
-
[55]
Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2024. Unilog: Automatic logging via llm and in-context learning. In Proceedings of the 46th ieee/acm international conference on software engineering . 1–12
2024
-
[56]
An Yang, Anfeng Li, Baosong Yang, and et al. 2025. Qwen3 Technical Report. doi:10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[57]
Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu. 2025. One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE). Proceedings of the ACM on Software Engineering FSE (2025), 446–464
2025
-
[58]
Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu. 2025. Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM. Proceedings of the ACM on Software Engineering 2, FSE (2025), 489–510
2025
-
[59]
de Pádua, Weiyi Shang, Steve Sporea, Andrei Toma, and Sarah Sajedi
Kundi Yao, Guilherme B. de Pádua, Weiyi Shang, Steve Sporea, Andrei Toma, and Sarah Sajedi. 2018. Log4perf: Suggesting logging locations for web-based systems’ performance monitoring. InProceedings of the 2018 ACM/SPEC International Conference on Performance Engineering . 127–138
2018
-
[60]
de Pádua, Weiyi Shang, Catalin Sporea, Andrei Toma, and Sarah Sajedi
Kundi Yao, Guilherme B. de Pádua, Weiyi Shang, Catalin Sporea, Andrei Toma, and Sarah Sajedi. 2020. Log4Perf: suggesting and updating logging locations for web-based systems’ performance monitoring. Empirical Software Engineering 25, 1 (2020), 488–531
2020
-
[61]
Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering (2024)
2024
-
[62]
Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang. 2024. Thinkrepair: Self-directed automated program repair. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1274–1286
2024
-
[63]
Ding Yuan, Soyeon Park, and Yuanyuan Zhou. 2012. Characterizing logging practices in open-source software. In 2012 34th international conference on software engineering (ICSE) . IEEE, 102–112
2012
-
[64]
Chen Zhi, Jianwei Yin, Junxiao Han, and Shuiguang Deng. 2020. A preliminary study on sensitive information exposure through logging. In 2020 27th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 470–474
2020
-
[65]
Renyi Zhong, Yichen Li, Jinxi Kuang, Wenwei Gu, Yintong Huo, and Michael R. Lyu. 2025. LogUpdater: Automated Detection and Repair of Specific Defects in Logging Statements. ACM Trans. Softw. Eng. Methodol. (2025)
2025
- [66]
-
[67]
Rui Zhou, Mohammad Hamdaqa, Haipeng Cai, and Abdelwahab Hamou-Lhadj. 2020. Mobilogleak: A preliminary study on data leakage caused by poor logging practices. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 577–581. Received 2025-09-11; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE, ...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.