AnomalyGen: Enhancing Log-Based Anomaly Detection with Code-Guided Data Augmentation
Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3
The pith
AnomalyGen generates labeled log sequences from source code to augment training data for better anomaly detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that training data for log-based anomaly detection can be effectively augmented by synthesizing sequences directly from the program's source code. This is done through building log-oriented control flow graphs to list possible paths, using chain-of-thought reasoning in large language models to ensure logical consistency and generate runtime parameters, and applying domain heuristics to assign labels. As a result, anomaly detection models can better handle valid but previously unseen execution paths that were causing false positives.
What carries the argument
AnomalyGen, a three-stage framework that builds Log-Oriented Control Flow Graphs (LCFGs) from source code, applies LLM Chain-of-Thought reasoning for verification and parameter generation, and labels sequences using domain heuristics.
If this is right
- Augmented datasets lead to improved performance for a wide range of anomaly detection models on real-world systems.
- Both the static analysis component for path enumeration and the LLM-based verification step are essential to the gains.
- The approach works for both supervised and unsupervised detection techniques.
- Public release of the framework and datasets enables further research on code-guided augmentation.
Where Pith is reading between the lines
- If the method scales to larger codebases, it could significantly reduce the data collection burden for training reliable anomaly detectors in industry.
- Similar code-to-data synthesis might be applied to other software analysis tasks like test generation or performance modeling.
- Future work could test if the generated sequences remain effective when the source code evolves over time with new features.
Load-bearing premise
The generated log sequences must accurately represent real runtime behaviors and have correct anomaly labels so that they enhance rather than confuse the training of detection models.
What would settle it
Training models on the original data versus the augmented data and observing no improvement or a decrease in detection accuracy on a separate test set of real logs.
Figures
read the original abstract
Log-based anomaly detection is fundamentally constrained by training data sparsity. Our empirical study reveals that public benchmark datasets cover less than 10% of source code log templates. Consequently, models frequently misclassify unseen but valid execution paths as anomalies, leading to false alarms. To address this, we propose AnomalyGen, a novel framework that augments training data by synthesizing labeled log sequences from source code. AnomalyGen combines log-oriented static analysis with Large Language Model (LLM) reasoning in three stages: (1) building Log-Oriented Control Flow Graphs (LCFGs) to enumerate structurally valid execution paths; (2) applying LLM Chain-of-Thought (CoT) reasoning to verify logical consistency and generate realistic runtime parameters (e.g., block IDs, IP addresses); and (3) labeling generated sequences with domain heuristics. Evaluations on HDFS and Zookeeper across 12 diverse anomaly detection models show AnomalyGen consistently improves performance. Deep learning models achieved average F1-score gains of 2.18% (HDFS) and 1.69% (Zookeeper), with an unsupervised Transformer on HDFS jumping from 0.818 to 0.970. Ablation results show that both static analysis and LLM-based verification are necessary: removing them reduces F1 by up to 8.7 and 10.7 percentage points, respectively. Our framework and datasets are publicly available to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AnomalyGen, a framework that augments sparse training data for log-based anomaly detection by synthesizing labeled log sequences directly from source code. It constructs Log-Oriented Control Flow Graphs (LCFGs) to enumerate structurally valid execution paths, employs LLM Chain-of-Thought reasoning to generate realistic runtime parameters and verify consistency, and applies domain heuristics for labeling. Evaluations across HDFS and Zookeeper using 12 anomaly detection models report consistent F1-score gains (average 2.18% for deep learning models on HDFS), with an unsupervised Transformer improving from 0.818 to 0.970 on HDFS; ablations indicate both static analysis and LLM verification are necessary.
Significance. If the generated sequences are distributionally faithful to real executions and correctly labeled, the work directly tackles the data sparsity problem the authors quantify (public benchmarks cover <10% of source-code log templates), potentially reducing false alarms on unseen valid paths. Strengths include the public release of the framework and datasets, the breadth of the 12-model evaluation, and the ablation results that isolate component contributions. These elements support reproducibility and could influence practical log anomaly detection pipelines.
major comments (2)
- [Evaluation] Evaluation section (performance claims): The headline result—an unsupervised Transformer F1-score rising from 0.818 to 0.970 on HDFS—is large enough to be load-bearing for the central claim. Yet the manuscript provides no direct evidence (manual inspection, distributional comparison to real traces, or parameter-validity checks) that LLM-synthesized values (block IDs, IP addresses, etc.) are executable or that heuristic labels are accurate. Without such validation, the gains could arise from models exploiting generation artifacts rather than learning improved coverage of valid paths.
- [Ablation study] Ablation study: While removing LCFG construction or LLM verification reduces F1 by up to 8.7 and 10.7 points respectively, the ablations do not include any metric of generated-data fidelity (e.g., fraction of paths that match real executions or statistical tests on parameter distributions). This omission leaves open whether the retained data truly augments coverage of unseen but valid paths or merely supplies easier-to-classify examples.
minor comments (4)
- The abstract states that the framework and datasets are publicly available; the main text should include an explicit repository URL and commit hash for reproducibility.
- [Methodology] Methodology section: Provide the exact domain heuristics used for labeling and the full LLM prompts (including CoT instructions) so readers can assess potential label noise or hallucination risks.
- [Evaluation] Evaluation section: Report statistical significance (e.g., paired t-tests or bootstrap confidence intervals) for the F1 improvements and include error bars on the per-model results.
- [Discussion] Discussion: Add a limitations paragraph addressing possible failure modes of LCFG construction (e.g., incomplete static analysis of complex control flow) and how they might affect downstream detection performance.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the need for stronger validation of the synthesized data. We address each major point below and propose revisions to incorporate direct fidelity checks.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (performance claims): The headline result—an unsupervised Transformer F1-score rising from 0.818 to 0.970 on HDFS—is large enough to be load-bearing for the central claim. Yet the manuscript provides no direct evidence (manual inspection, distributional comparison to real traces, or parameter-validity checks) that LLM-synthesized values (block IDs, IP addresses, etc.) are executable or that heuristic labels are accurate. Without such validation, the gains could arise from models exploiting generation artifacts rather than learning improved coverage of valid paths.
Authors: We agree that direct validation of the generated sequences would provide stronger support for the headline claims. The current manuscript relies on indirect evidence: the large performance gains occur only when both LCFG construction and LLM verification are present, and the ablations show substantial drops (up to 10.7 points) when either is removed. This pattern is difficult to explain solely by artifacts, as random or invalid sequences would not systematically improve coverage of unseen valid paths. Nevertheless, we will add a new subsection in the revised manuscript that reports (1) statistical comparisons (e.g., Kolmogorov-Smirnov tests) of parameter distributions between generated and real traces, (2) the fraction of generated paths that match observed real executions, and (3) a manual audit of 200 randomly sampled sequences for executability and label correctness. These additions will directly address the concern that gains might stem from artifacts. revision: yes
-
Referee: [Ablation study] Ablation study: While removing LCFG construction or LLM verification reduces F1 by up to 8.7 and 10.7 points respectively, the ablations do not include any metric of generated-data fidelity (e.g., fraction of paths that match real executions or statistical tests on parameter distributions). This omission leaves open whether the retained data truly augments coverage of unseen but valid paths or merely supplies easier-to-classify examples.
Authors: We acknowledge that the existing ablation results measure only downstream F1 impact and do not quantify data fidelity. In the revision we will augment the ablation study with explicit fidelity metrics: the percentage of generated execution paths that appear in the original real traces, and distributional similarity tests on runtime parameters. These metrics will be reported for the full pipeline versus the ablated variants, allowing readers to assess whether the retained data improves coverage of valid paths rather than merely providing easier examples. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives its performance gains from an external pipeline: LCFG construction from source code, LLM CoT parameter synthesis, heuristic labeling, and then training/evaluation on held-out portions of standard HDFS and Zookeeper benchmarks. Reported F1 improvements (including the 0.818→0.970 Transformer jump) and ablations are measured against fixed test sets that are not used in generation or labeling; no equation or step reduces the final metric to a fitted parameter or self-defined quantity. No self-citation load-bearing steps, imported uniqueness theorems, or ansatz smuggling appear. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Log-Oriented Control Flow Graphs (LCFGs)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Adrninistrator. 2025. java-callgraph2: Programs for producing static call graphs for Java programs. https://github.com/Adrninistrator/java-callgraph2
work page 2025
-
[3]
Stefan Andonov and Gjorgji Madjarov. 2023. LogGC: Novel Approach for Graph-based Log Anomaly Detection. In2023 IEEE International Conference on Data Mining Workshops (ICDMW). 1194–1202. doi:10.1109/ICDMW60847.2023.00156
-
[4]
M. Chen, A.X. Zheng, J. Lloyd, M.I. Jordan, and E. Brewer. 2004. Failure diagnosis using decision trees. InInternational Conference on Autonomic Computing, 2004. Proceedings.36–43. doi:10.1109/ICAC.2004.1301345
- [5]
-
[6]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL] https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Zishuo Ding, Heng Li, and Weiyi Shang. 2022. Logentext: Automatically generating logging texts using neural machine translation. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 349–360
work page 2022
-
[8]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. InProceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298
work page 2017
-
[9]
Chiming Duan, Minghua He, Pei Xiao, Tong Jia, Xin Zhang, Zhewei Zhong, Xiang Luo, Yan Niu, Lingzhe Zhang, Yifan Wu, Siyu Yu, Weijie Hong, Ying Li, and Gang Huang. 2025. LogAction: Consistent Cross-system Anomaly Detection through Logs via Active Domain Adaptation. arXiv:2510.03288 [cs.LG] https://arxiv.org/abs/2510.03288
-
[10]
Evelyn Fix and Joseph L. Hodges. 1989. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties.International Statistical Review57 (1989), 238. https://api.semanticscholar.org/CorpusID:120323383
work page 1989
-
[11]
Hongcheng Guo, Jian Yang, Jiaheng Liu, Jiaqi Bai, Boyang Wang, Zhoujun Li, Tieqiao Zheng, Bo Zhang, Junran Peng, and Qi Tian. 2024. Logformer: A pre-train and tuning pipeline for log anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 135–143
work page 2024
-
[12]
Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R. Lyu. 2021. A Survey on Automated Log Analysis for Reliability Engineering.ACM Comput. Surv.54, 6, Article 130 (July 2021), 37 pages. doi:10.1145/3460345
-
[13]
Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience Report: System Log Analysis for Anomaly Detection. In2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). 207–218. doi:10.1109/ISSRE.2016.21
- [14]
-
[15]
Guang-Bin Huang, Yan-Qiu Chen, and H. A. Babri. 2000. Classification ability of single hidden layer feedforward neural networks.Trans. Neur. Netw.11, 3 (May 2000), 799–801. doi:10.1109/72.846750
-
[16]
Shaohan Huang, Yi Liu, Carol Fung, Rong He, Yining Zhao, Hailong Yang, and Zhongzhi Luan. 2020. HitAnomaly: Hierarchical Transformers for Anomaly Detection in System Log.IEEE Transactions on Network and Service Management17, 4 (2020), 2064–2076. doi:10.1109/TNSM.2020.3034647
-
[17]
Yintong Huo, Yichen Li, Yuxin Su, Pinjia He, Zifan Xie, and Michael R Lyu. 2023. Autolog: A log sequence synthesis framework for anomaly detection. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 497–509
work page 2023
-
[18]
Van-Hoang Le and Hongyu Zhang. 2022. Log-based anomaly detection with deep learning: How far are we?. InProceedings of the 44th international conference on software engineering. 1356–1367
work page 2022
-
[19]
Van-Hoang Le and Hongyu Zhang. 2024. PreLog: A Pre-trained Model for Log Analytics. 2, 3, Article 163 (May 2024), 28 pages. doi:10.1145/3654966
-
[20]
Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, and Guangba Yu. 2020. Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults. In2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 92–103
work page 2020
-
[21]
Yichen Li, Yintong Huo, Zhihan Jiang, Renyi Zhong, Pinjia He, Yuxin Su, Lionel C Briand, and Michael R Lyu. 2024. Exploring the effectiveness of llms in automated logging statement generation: An empirical study.IEEE Transactions on Software Engineering(2024)
work page 2024
-
[22]
Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, and Michael R Lyu. 2024. Go static: Contextualized logging statement generation.Proceedings of the ACM on Software Engineering1, FSE (2024), 609–630
work page 2024
-
[23]
Zhong Li, Jiayang Shi, and Matthijs Van Leeuwen. 2024. Graph Neural Networks based Log Anomaly Detection and Explanation. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings(Lisbon, Portugal)(ICSE-Companion ’24). Association for Computing Machinery, New York, NY, USA, 306–307. doi:10.1145/3639478.3643084
-
[24]
Zhong Li, Jiayang Shi, and Matthijs Van Leeuwen. 2024. Graph neural networks based log anomaly detection and explanation. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 306–307
work page 2024
-
[25]
Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yuhang Chen, Yanqing Zhao, Hao Yang, and Yanfei Jiang. 2024. Interpretable online log analysis using large language models with prompt strategies. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 35–46. Manuscript submitted to ACM 22 Xinyu Li, Yintong Huo, Ch...
work page 2024
-
[26]
Yilun Liu, Shimin Tao, Weibin Meng, Feiyu Yao, Xiaofeng Zhao, and Hao Yang. 2024. Logprompt: Prompt engineering towards zero-shot and interpretable log analysis. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 364–365
work page 2024
-
[27]
Siyang Lu, Xiang Wei, Yandong Li, and Liqiang Wang. 2018. Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Cong...
-
[28]
Lipeng Ma, Weidong Yang, Bo Xu, Sihang Jiang, Ben Fei, Jiaqing Liang, Mingjie Zhou, and Yanghua Xiao. 2024. Knowlog: Knowledge enhanced pre-trained language model for log understanding. InProceedings of the 46th ieee/acm international conference on software engineering. 1–13
work page 2024
-
[29]
Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using deep learning to generate complete log statements. InProceedings of the 44th International Conference on Software Engineering. 2279–2290
work page 2022
-
[30]
Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, and Rong Zhou. 2019. Loganomaly: unsupervised detection of sequential and quantitative anomalies in unstructured logs. InProceedings of the 28th International Joint Conference on Artificial Intelligence(Macao, China)(IJCAI’19). AAAI Press...
work page 2019
-
[31]
Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). USENIX Association, San Jose, CA, 353–366. https: //www.usenix.org/conference/nsdi12/technical-sessions/presentation/nagaraj
work page 2012
-
[32]
Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2020. Self-Attentive Classification-Based Anomaly Detection in Unstructured Logs. In2020 IEEE International Conference on Data Mining (ICDM). 1196–1201. doi:10.1109/ICDM50108.2020.00148
-
[33]
Brian A Nejmeh. 1988. NPATH: A measure of execution path complexity and its applications.Commun. ACM31, 2 (1988), 188–200
work page 1988
-
[34]
Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Yukun Wang, Carol Fung, Hailong Yang, and Depei Qian. 2022. Adanomaly: Adaptive Anomaly Detection for System Logs with Adversarial Learning. InNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium. 1–5. doi:10.1109/NOMS54207.2022.9789917
-
[35]
Yoli Shavit, Kathy Razmadze, Gary Mataev, Hanan Shteingart, Eitan Zahavi, and Zachi Binshtock. 2024. SemantiLog: Log-based Anomaly Detection with Semantic Similarity. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2438–2439
work page 2024
-
[36]
Danny van Bruggen, Federico Tomassetti, Roger Howell, Malte Langkabel, Nicholas Smith, Artur Bosch, Malte Skoruppa, Cruz Maximilien, ThLeu, Panayiotis, Sebastian Kirsch, Simon, Johann Beleites, Wim Tibackx, jean pierre L, André Rouél, edefazio, Daan Schipper, Mathiponds, Why you want to know, Ryan Beckett, ptitjes, kotari4u, Marvin Wyrich, Ricardo Morais,...
-
[37]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL] https://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2024. Unilog: Automatic logging via llm and in-context learning. InProceedings of the 46th ieee/acm international conference on software engineering. 1–12
work page 2024
-
[39]
Kenji Yamanishi and Yuko Maruyama. 2005. Dynamic syslog mining for network failure monitoring. InProceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining(Chicago, Illinois, USA)(KDD ’05). Association for Computing Machinery, New York, NY, USA, 499–508. doi:10.1145/1081870.1081927
-
[40]
Lin Yang, Junjie Chen, Zan Wang, Weijing Wang, Jiajun Jiang, Xuyuan Dong, and Wenbin Zhang. 2021. Plelog: Semi-supervised log-based anomaly detection via probabilistic label estimation. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 230–231
work page 2021
-
[41]
Shi Ying, Bingming Wang, Lu Wang, Qingshan Li, Yishi Zhao, Jianga Shang, Hao Huang, Guoli Cheng, Zhe Yang, and Jiangyi Geng. 2021. An Improved KNN-Based Efficient Log Anomaly Detection Method with Automatically Labeled Samples.ACM Trans. Knowl. Discov. Data15, 3, Article 34 (April 2021), 22 pages. doi:10.1145/3441448
-
[42]
Boxi Yu, Jiayi Yao, Qiuai Fu, Zhiqing Zhong, Haotian Xie, Yaoliang Wu, Yuchi Ma, and Pinjia He. 2024. Deep learning or classical machine learning? an empirical study on log-based anomaly detection. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13
work page 2024
-
[43]
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Randolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang. 2019. Robust log-based anomaly detection on unstable log data(ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA....
- [44]
-
[45]
Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, and Michael R Lyu. 2023. Loghub: A large collection of system log datasets for ai-driven log analytics. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 355–366. Manuscript submitted to ACM
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.