pith. machine review for the scientific record. sign in

arxiv: 2604.20211 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.AI· cs.CR

Recognition: unknown

Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:43 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR
keywords logging code securityLLM detectioncode repairsecurity taxonomybenchmark datasetissue patternsprompt evaluation
0
0 comments X

The pith

LLMs detect logging security issues with moderate accuracy but struggle to generate reliable code repairs on a benchmark of 101 real cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Logging code records events for debugging yet can leak data or enable injection attacks when insecure. The paper creates a taxonomy of four issue categories and ten patterns, then builds a dataset from 101 manually reviewed real-world reports. It tests LLMs on detection and repair tasks using prompts that include issue descriptions, security patterns, or both. Results show detection accuracy between 12.9 percent and 52.5 percent on average, with repairs remaining unreliable, and that issue descriptions alone improve detection more than patterns or combined inputs. The work supplies practitioners with concrete patterns to check and clarifies current limits of LLMs for this security task.

Core claim

The authors derive a taxonomy of logging code security issues with four common categories and ten patterns. They assemble a benchmark of 101 real-world reports after manual review and annotation. An automated framework then evaluates LLMs on detection and repair, revealing moderate detection performance with accuracies from 12.9 percent to 52.5 percent on average while exposing clear difficulties in producing correct repairs. Issue descriptions alone raise detection accuracy more effectively than security pattern explanations or their combination.

What carries the argument

The taxonomy of four categories and ten logging security patterns, paired with the manually annotated benchmark of 101 real-world reports, inside a prompt-based evaluation framework that supplies different contextual knowledge to LLMs.

If this is right

  • The taxonomy supplies developers with specific patterns to audit logging code for leaks or injection risks.
  • Issue descriptions alone can be prioritized in prompts to raise LLM detection rates without adding pattern explanations.
  • LLMs can serve as initial detectors for logging issues but require human review to ensure repair quality.
  • The benchmark enables repeated testing of future models on the same set of real cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same taxonomy-plus-benchmark method could be applied to security issues in other code areas such as authentication or data serialization.
  • Training data that emphasizes issue descriptions over abstract patterns may improve LLM performance on security-related code tasks.
  • The performance gap between detection and repair suggests that separate models or stages could be used for each step in secure-logging tools.

Load-bearing premise

The 101 manually reviewed and annotated real-world logging security issue reports form a representative set that captures the full range of practical problems.

What would settle it

A larger or independently collected set of logging issues where LLMs produce correct repairs for most cases, or where detection accuracy falls outside the reported 12.9 to 52.5 percent range.

Figures

Figures reproduced from arXiv: 2604.20211 by An Ran Chen, He Yang Yuan, Kundi Yao, Xin Wang, Zhenhao Li, Zishuo Ding.

Figure 1
Figure 1. Figure 1: A log injection vulnerability. Malicious input containing encoded newline characters is logged without [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our study. logging defects based on types mined from commit histories, primarily targeting inconsistency and readability issues. Prior studies have revealed that these logging code quality problems mainly affect the usefulness, reliability, and interpretability of the generated logs themselves. In contrast, the logging code security issues examined in our work can compromise the logs (e.g., … view at source ↗
Figure 3
Figure 3. Figure 3: Case study with different Patch Similarity. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of average detection accuracy and patch similarity across four settings’ token usage. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring. However, insecure logging practices can inadvertently expose sensitive information or enable attacks such as log injection, posing serious threats to system security and privacy. Prior research has examined general defects in logging code, but systematic analysis of logging code security issues remains limited, particularly in leveraging LLMs for detection and repair. In this paper, we derive a comprehensive taxonomy of logging code security issues, encompassing four common issue categories and 10 corresponding patterns. We further construct a benchmark dataset with 101 real-world logging security issue reports that have been manually reviewed and annotated. We then propose an automated framework that incorporates various contextual knowledge to evaluate LLMs' capabilities in detecting and repairing logging security issues. Our experimental results reveal a notable disparity in performance: while LLMs are moderately effective at detecting security issues (e.g., the accuracy ranges from 12.9% to 52.5% on average), they face noticeable challenges in reliably generating correct code repairs. We also find that the issue description alone improves the LLMs' detection accuracy more than the security pattern explanation or a combination of both. Overall, our findings provide actionable insights for practitioners and highlight the potential and limitations of current LLMs for secure logging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives a taxonomy of logging code security issues with four categories and ten patterns, constructs a benchmark of 101 real-world reports that were manually reviewed and annotated, and evaluates LLMs on detection and repair tasks using different combinations of issue descriptions and security pattern explanations as context. It reports moderate detection performance (accuracies from 12.9% to 52.5%) but poor repair success, and finds that supplying only the issue description outperforms using security patterns or both.

Significance. If the ground-truth labels are reliable, the work provides a useful characterization of an under-studied class of security issues in logging code and the first benchmark for assessing LLM assistance in this area. The comparative evaluation of contextual prompts yields actionable insights for practitioners. The construction of an external, real-world benchmark is a positive aspect that could support follow-on research if the dataset and annotation details are made available.

major comments (2)
  1. [Benchmark dataset section] Benchmark dataset section: The paper states that the 101 reports 'have been manually reviewed and annotated' against the 4-category/10-pattern taxonomy, but supplies no details on annotator count, inter-annotator agreement (e.g., Cohen's kappa), disagreement resolution, selection criteria for the reports, or external validation. All reported detection accuracies, repair results, and the claim that 'issue description alone improves the LLMs' detection accuracy more than the security pattern explanation' rest on the correctness of these labels; the absence of this information makes the central empirical claims unverifiable.
  2. [Experimental evaluation section] Experimental evaluation section: The manuscript reports accuracy ranges and comparative prompt results but does not specify the exact LLM versions, full prompting templates, number of runs to account for stochasticity, or any statistical tests for differences between conditions. These omissions directly affect reproducibility and the strength of the comparative finding.
minor comments (2)
  1. [Abstract and Results] The abstract and results text should clarify what 'on average' refers to when stating the 12.9%–52.5% accuracy range (across models, categories, or runs).
  2. [Taxonomy section] The derivation process for the taxonomy (how the four categories and ten patterns were obtained from prior logging-defect literature) could be described more explicitly, with additional citations to related work on logging defects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The points raised are valid and will help improve the clarity, verifiability, and reproducibility of our work. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Benchmark dataset section] Benchmark dataset section: The paper states that the 101 reports 'have been manually reviewed and annotated' against the 4-category/10-pattern taxonomy, but supplies no details on annotator count, inter-annotator agreement (e.g., Cohen's kappa), disagreement resolution, selection criteria for the reports, or external validation. All reported detection accuracies, repair results, and the claim that 'issue description alone improves the LLMs' detection accuracy more than the security pattern explanation' rest on the correctness of these labels; the absence of this information makes the central empirical claims unverifiable.

    Authors: We agree that the absence of these details limits the verifiability of the ground-truth labels. In the revised manuscript, we will expand the Benchmark dataset section with the requested information: annotator count (two authors annotated independently), inter-annotator agreement (Cohen's kappa will be computed and reported), disagreement resolution (via discussion to reach consensus), selection criteria (reports drawn from GitHub issues and public vulnerability databases using targeted keywords for logging security issues), and external validation steps. These additions will directly support the reliability of our detection and repair results without altering the original findings. revision: yes

  2. Referee: [Experimental evaluation section] Experimental evaluation section: The manuscript reports accuracy ranges and comparative prompt results but does not specify the exact LLM versions, full prompting templates, number of runs to account for stochasticity, or any statistical tests for differences between conditions. These omissions directly affect reproducibility and the strength of the comparative finding.

    Authors: We concur that these omissions hinder reproducibility and weaken the comparative claims. In the revision, we will update the Experimental evaluation section to specify exact LLM versions (e.g., GPT-4-turbo, Claude-3), include full prompting templates (in an appendix), report the number of runs (three runs per condition with averaged results to address stochasticity), and add statistical tests (e.g., McNemar's test or Wilcoxon signed-rank test) for differences across prompt conditions. This will make the evaluation fully reproducible and provide stronger support for the finding that issue descriptions outperform other contexts. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no self-referential derivations or fitted predictions

full rationale

The paper derives a taxonomy from manual review of real-world reports, constructs an external benchmark of 101 annotated logging security issues, and then runs standard LLM detection/repair experiments against that benchmark. No equations, parameters, or predictions appear; the reported accuracies (12.9–52.5 %) and comparative prompt findings are direct empirical measurements on independently labeled external data rather than quantities forced by construction from the inputs. No self-citation chains, ansatzes, or renamings of known results are load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the manually curated benchmark and the assumption that the four-category taxonomy comprehensively covers common logging security issues.

axioms (1)
  • domain assumption Manually reviewed and annotated real-world reports accurately and representatively capture logging security issues
    The benchmark dataset is constructed from 101 such reports; the performance numbers depend on this labeling being correct and complete.

pith-pipeline@v0.9.0 · 5549 in / 1335 out tokens · 33351 ms · 2026-05-10T00:43:34.885054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    CVE: Common Vulnerabilities and Exposures

    2025. CVE: Common Vulnerabilities and Exposures. https://www.cve.org/. Last accessed September 2025

  2. [2]

    Log4Shell

    2025. Log4Shell. https://en.wikipedia.org/wiki/Log4Shell. Last accessed September 2025

  3. [3]

    MITRE - CWE List

    2025. MITRE - CWE List. https://cwe.mitre.org/data/index.html. Last accessed September 2025

  4. [4]

    2025. Ollama. https://ollama.com. Last accessed September 2025

  5. [5]

    Replicaton Package

    2026. Replicaton Package. https://github.com/defects4log/SecLogging. Last accessed April 2026

  6. [6]

    Roozbeh Aghili, Heng Li, and Foutse Khomh. 2025. Protecting Privacy in Software Logs: What Should Be Anonymized? Proceedings of the ACM on Software Engineering 2, FSE (2025), 1317–1338

  7. [7]

    Roozbeh Aghili, Xingfang Wu, Foutse Khomh, and Heng Li. 2025. SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs. arXiv preprint arXiv:2505.14976 (2025)

  8. [8]

    Adil Ahmad, Sangho Lee, and Marcus Peinado. 2022. Hardlog: Practical tamper-proof system auditing using a novel audit device. In 2022 IEEE Symposium on Security and Privacy (SP) . IEEE, 1791–1807

  9. [9]

    Tuan-Dung Bui, Thanh Trong Vu, Thu-Trang Nguyen, Son Nguyen, and Hieu Dinh Vo. 2025. Correctness Assessment of Code Generated by Large Language Models Using Internal Representations. arXiv preprint arXiv:2501.12934 (2025)

  10. [10]

    Jeanderson Cândido, Maurício Aniche, and Arie Van Deursen. 2021. Log-based software monitoring: a systematic mapping study. PeerJ Computer Science 7 (2021), e489

  11. [11]

    Boyuan Chen and Zhen Ming Jiang. 2017. Characterizing and Detecting Anti-Patterns in the Logging Code. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)

  12. [12]

    Haoyu Chen, Shanshan Tu, Chunye Zhao, and Yongfeng Huang. 2016. Provenance cloud security auditing system based on log analysis. In 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS) . IEEE, 155–159

  13. [13]

    Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo. 2024. Code search is all you need? improving code suggestions with code search. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE). 1–13

  14. [14]

    Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. 2025. SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios. arXiv preprint arXiv:2509.22097 (2025)

  15. [15]

    Junkai Chen, Zhenhao Li, Qiheng Mao, Xing Hu, Kui Liu, and Xin Xia. 2025. Understanding practitioners’ expectations on clear code review comments. Proceedings of the ACM on Software Engineering 2, ISSTA (2025), 1257–1279

  16. [16]

    DeepSeek-AI, Daya Guo, Dejian Yang, and et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  17. [17]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, and et al. Wang. 2025. DeepSeek-V3 Technical Report. doi:10.48550/arXiv. 2412.19437

  18. [18]

    Zishuo Ding, Yiming Tang, Yang Li, Heng Li, and Weiyi Shang. 2023. On the temporal relations between logging and code. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 843–854

  19. [19]

    Zishuo Ding, Yiming Tang, Yang Li, Heng Li, and Weiyi Shang. 2023. On the Temporal Relations between Logging and Code. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)

  20. [20]

    Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. AC/C++ code vulnerability dataset with code changes and CVE summaries. In Proceedings of the 17th international conference on mining software repositories . 508–512

  21. [21]

    Qiang Fu, Jieming Zhu, Wenlu Hu, Jian-Guang Lou, Rui Ding, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Where do developers log? an empirical study on logging practices in industry. In Companion Proceedings of the 36th International Conference on Software Engineering . 24–33

  22. [22]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and et al. 2024. The Llama 3 Herd of Models

  23. [23]

    Mehran Hassani, Weiyi Shang, Emad Shihab, and Nikolaos Tsantalis. 2018. Studying and detecting log-related issues. Empirical Softw. Engg. (2018). Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE034. Publication date: July 2026. FSE034:20 He Yang Yuan, Xin Wang, Kundi Yao, An Ran Chen, Zishuo Ding, and Zhenhao Li

  24. [24]

    Yi Wen Heng, Zeyang Ma, Zhenhao Li, Dong Jae Kim, et al. 2024. Studying and Benchmarking Large Language Models For Log Level Suggestion. arXiv preprint arXiv:2410.08499 (2024)

  25. [25]

    Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, and David Lo. 2025. Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks. arXiv preprint arXiv:2505.08903 (2025)

  26. [26]

    Xing Hu, Xin Xia, David Lo, Zhiyuan Wan, Qiuyuan Chen, and Thomas Zimmermann. 2022. Practitioners’ Expectations on Automated Code Comment Generation. In Proceedings of the 44th international conference on software engineering . 1693–1705

  27. [27]

    Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. 2024. Lilac: Log parsing using llms with adaptive parsing cache. Proceedings of the ACM on Software Engineering 1, FSE (2024), 137–160

  28. [28]

    Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R Lyu. 2024. A large-scale evaluation for log parsing techniques: How far are we?. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis . 223–234

  29. [29]

    Staffs Keele et al. 2007. Guidelines for performing systematic literature reviews in software engineering . Technical Report. Technical report, ver. 2.3 ebse technical report. ebse

  30. [30]

    Amirmahdi Khosravi Tabrizi, Naser Ezzati-Jivan, and Francois Tetreault. 2024. An Adaptive Logging System (ALS): En- hancing Software Logging with Reinforcement Learning Techniques. InProceedings of the 15th ACM/SPEC International Conference on Performance Engineering . 37–47

  31. [31]

    Heng Li, Weiyi Shang, Bram Adams, Mohammed Sayagh, and Ahmed E Hassan. 2020. A qualitative study of the benefits and costs of logging from developers’ perspectives. IEEE Transactions on Software Engineering 47, 12 (2020), 2858–2873

  32. [32]

    Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2024. Go Static: Contextualized Logging Statement Generation. Proc. ACM Softw. Eng. (2024)

  33. [33]

    Yi Li, Shaohua Wang, and Tien N Nguyen. 2021. Vulnerability detection with fine-grained interpretations. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 292–303

  34. [34]

    Zhenhao Li, An Ran Chen, Xing Hu, Xin Xia, Tse-Hsun Chen, and Weiyi Shang. 2023. Are they all good? studying practitioners’ expectations on the readability of log messages. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 129–140

  35. [35]

    Zhenhao Li, Tse-Hsun Chen, and Weiyi Shang. 2020. Where Shall We Log? Studying and Suggesting Logging Locations in Code Blocks. In 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 . 361–372

  36. [36]

    Zhenhao Li, Tse-Hsun Chen, Jinqiu Yang, and Weiyi Shang. 2019. DLFinder: Characterizing and Detecting Duplicate Logging Code Smells. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)

  37. [37]

    Zhenhao Li, Heng Li, Tse-Hsun Chen, and Weiyi Shang. 2021. Deeplv: Suggesting log levels using ordinal based neural networks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) . IEEE, 1461–1472

  38. [38]

    Zhenhao Li, Chuan Luo, Tse-Hsun Chen, Weiyi Shang, Shilin He, Qingwei Lin, and Dongmei Zhang. 2023. Did we miss something important? Studying and exploring variable-aware log abstraction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . 830–842

  39. [39]

    Lipeng Ma, Weidong Yang, Bo Xu, Sihang Jiang, Ben Fei, Jiaqing Liang, Mingjie Zhou, and Yanghua Xiao. 2024. Knowlog: Knowledge enhanced pre-trained language model for log understanding. In Proceedings of the 46th ieee/acm international conference on software engineering . 1–13

  40. [40]

    Zeyang Ma, Dong Jae Kim, and Tse-Hsun Chen. 2024. LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models. arXiv preprint arXiv:2408.01585 (2024)

  41. [41]

    Qiheng Mao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia, and Jianling Sun. 2025. Towards explainable vulnerability detection with large language models. IEEE Transactions on Software Engineering (2025)

  42. [42]

    Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using deep learning to generate complete log statements. In Proceedings of the 44th international conference on software engineering . 2279–2290

  43. [43]

    Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica 22, 3 (2012), 276–282

  44. [44]

    Andriy Miranskyy, Abdelwahab Hamou-Lhadj, Enzo Cialini, and Alf Larsson. 2016. Operational-log analysis for big data systems: Challenges and solutions. IEEE Software 33, 2 (2016), 52–59

  45. [45]

    Zulie Pan, Yu Chen, Yuanchao Chen, Yi Shen, and Yang Li. 2022. LogInjector: Detecting web application log injection vulnerabilities. Applied Sciences 12, 15 (2022), 7681

  46. [46]

    Yun Peng, Jun Wan, Yichen Li, and Xiaoxue Ren. 2025. Coffe: A code efficiency benchmark for code generation. Proceedings of the ACM on Software Engineering 2, FSE (2025), 242–265

  47. [47]

    Qwen, An Yang, Baosong Yang, Beichen Zhang, and et al. 2025. Qwen2.5 Technical Report. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE034. Publication date: July 2026. Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs FSE034:21

  48. [48]

    Xiaoxue Ren, Jun Wan, Yun Peng, Zhongxin Liu, Ming Liang, Dajun Chen, Wei Jiang, and Yong Li. 2025. PEACE: Towards Efficient Project-Level Efficiency Optimization via Hybrid Code Editing. arXiv preprint arXiv:2510.17142 (2025)

  49. [49]

    Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. 2022. What makes a good commit message?. In Proceedings of the 44th International Conference on Software Engineering . 2389–2401

  50. [50]

    Xin Wang, Zhenhao Li, and Zishuo Ding. 2025. Defects4Log: Benchmarking LLMs for Logging Code Defect Detection and Reasoning. In 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025, Seoul, Korea, Republic of, November 16-20, 2025 . IEEE, 1931–1942

  51. [51]

    Xin Wang, Zhenhao Li, and Zishuo Ding. 2026. LLM4Perf: Large Language Models Are Effective Samplers for Multi- Objective Performance Modeling. In Proceedings of the IEEE/ACM 48th International Conference on Software Engineering (ICSE)

  52. [52]

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023)

  53. [53]

    Claes Wohlin. 2014. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th international conference on evaluation and assessment in software engineering . 1–10

  54. [54]

    Yi Xiao, Van-Hoang Le, and Hongyu Zhang. 2024. free: Towards more practical log parsing with large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering . 153–165

  55. [55]

    Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2024. Unilog: Automatic logging via llm and in-context learning. In Proceedings of the 46th ieee/acm international conference on software engineering . 1–12

  56. [56]

    An Yang, Anfeng Li, Baosong Yang, and et al. 2025. Qwen3 Technical Report. doi:10.48550/arXiv.2505.09388

  57. [57]

    Xu Yang, Shaowei Wang, Jiayuan Zhou, and Wenhan Zhu. 2025. One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE). Proceedings of the ACM on Software Engineering FSE (2025), 446–464

  58. [58]

    Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu. 2025. Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM. Proceedings of the ACM on Software Engineering 2, FSE (2025), 489–510

  59. [59]

    de Pádua, Weiyi Shang, Steve Sporea, Andrei Toma, and Sarah Sajedi

    Kundi Yao, Guilherme B. de Pádua, Weiyi Shang, Steve Sporea, Andrei Toma, and Sarah Sajedi. 2018. Log4perf: Suggesting logging locations for web-based systems’ performance monitoring. InProceedings of the 2018 ACM/SPEC International Conference on Performance Engineering . 127–138

  60. [60]

    de Pádua, Weiyi Shang, Catalin Sporea, Andrei Toma, and Sarah Sajedi

    Kundi Yao, Guilherme B. de Pádua, Weiyi Shang, Catalin Sporea, Andrei Toma, and Sarah Sajedi. 2020. Log4Perf: suggesting and updating logging locations for web-based systems’ performance monitoring. Empirical Software Engineering 25, 1 (2020), 488–531

  61. [61]

    Xin Yin, Chao Ni, and Shaohua Wang. 2024. Multitask-based evaluation of open-source llm on software vulnerability. IEEE Transactions on Software Engineering (2024)

  62. [62]

    Xin Yin, Chao Ni, Shaohua Wang, Zhenhao Li, Limin Zeng, and Xiaohu Yang. 2024. Thinkrepair: Self-directed automated program repair. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1274–1286

  63. [63]

    Ding Yuan, Soyeon Park, and Yuanyuan Zhou. 2012. Characterizing logging practices in open-source software. In 2012 34th international conference on software engineering (ICSE) . IEEE, 102–112

  64. [64]

    Chen Zhi, Jianwei Yin, Junxiao Han, and Shuiguang Deng. 2020. A preliminary study on sensitive information exposure through logging. In 2020 27th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 470–474

  65. [65]

    Renyi Zhong, Yichen Li, Jinxi Kuang, Wenwei Gu, Yintong Huo, and Michael R. Lyu. 2025. LogUpdater: Automated Detection and Repair of Specific Defects in Logging Statements. ACM Trans. Softw. Eng. Methodol. (2025)

  66. [66]

    Renyi Zhong, Yichen Li, Guangba Yu, Wenwei Gu, Jinxi Kuang, Yintong Huo, and Michael R Lyu. 2025. Beyond LLMs: An Exploration of Small Open-source Language Models in Logging Statement Generation. arXiv preprint arXiv:2505.16590 (2025)

  67. [67]

    Rui Zhou, Mohammad Hamdaqa, Haipeng Cai, and Abdelwahab Hamou-Lhadj. 2020. Mobilogleak: A preliminary study on data leakage caused by poor logging practices. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 577–581. Received 2025-09-11; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE, ...