LLM-Augmented Knowledge Base Construction For Root Cause Analysis
Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3
The pith
Three LLM approaches construct a knowledge base from support tickets that accelerates root cause analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The experiments demonstrate that fine-tuning, RAG, and hybrid LLM methodologies can produce a root cause analysis knowledge base from support tickets that closely matches reference structures according to similarity metrics and thus serves as an excellent starting point for accelerating RCA tasks and improving network resilience.
What carries the argument
Comparison of fine-tuning, retrieval-augmented generation (RAG), and hybrid LLM methods for constructing an RCA knowledge base evaluated by lexical and semantic similarity metrics.
Load-bearing premise
Lexical and semantic similarity metrics accurately reflect how useful the generated knowledge base will be for engineers conducting actual root cause analysis.
What would settle it
A user study where network engineers attempt to diagnose the same set of outages using both the LLM-generated knowledge base and a gold-standard reference, then compare success rates and time taken.
Figures
read the original abstract
Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee "five 9s" (99.999 %) reliability, requiring rapid and accurate root cause analysis (RCA) during outages. In the event of an outage, rapid and accurate RCA becomes essential to restore service and prevent future disruptions. This study evaluates three Large Language Model (LLM) methodologies - Fine-Tuning, RAG, and a Hybrid approach - for constructing a Root Cause Analysis (RCA) Knowledge Base from support tickets. We compare their performance using a comprehensive suite of lexical and semantic similarity metrics. Our experiments on a real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates three LLM-based methodologies—Fine-Tuning, RAG, and a Hybrid approach—for constructing a Root Cause Analysis (RCA) knowledge base from industrial support tickets in communications networks. Performance is assessed via a suite of lexical and semantic similarity metrics against a reference KB, with the conclusion that the generated KB supplies an excellent starting point for accelerating RCA tasks and improving network resilience.
Significance. If the proxy similarity metrics were shown to correlate with downstream RCA utility, the methods could reduce manual effort in building domain-specific KBs and support faster outage resolution. The work targets a high-stakes industrial problem, but the current evaluation provides only indirect evidence, limiting claims about resilience improvements.
major comments (2)
- [Abstract] Abstract: the central claim that the generated KB provides an 'excellent starting point' for RCA and improves network resilience rests entirely on lexical/semantic similarity scores; no quantitative metric values, baseline comparisons, error bars, or validation that these scores predict actual RCA task performance (e.g., diagnostic accuracy or time-to-resolution) are reported.
- [Experiments] Experiments section: no direct RCA task evaluation is described, such as substituting the generated KB for the reference KB and measuring engineer time-to-diagnosis, diagnostic accuracy, or number of resolved tickets; lexical overlap and embedding similarity can be high while causal chains or actionability remain incomplete.
minor comments (2)
- [Methodology] The description of the Hybrid approach should specify exactly how the outputs of Fine-Tuning and RAG are combined and how any conflicts are resolved.
- [Evaluation Metrics] All similarity metrics (BLEU, ROUGE, cosine similarity, etc.) should be accompanied by explicit formulas, implementation details, and the precise reference KB construction process.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the distinction between proxy metrics and direct task utility. We address each major comment below and outline targeted revisions to clarify the scope and limitations of our evaluation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the generated KB provides an 'excellent starting point' for RCA and improves network resilience rests entirely on lexical/semantic similarity scores; no quantitative metric values, baseline comparisons, error bars, or validation that these scores predict actual RCA task performance (e.g., diagnostic accuracy or time-to-resolution) are reported.
Authors: The Experiments section reports concrete metric values (e.g., BLEU, ROUGE, BERTScore, and embedding cosine similarities), method comparisons, and variability across runs. We agree the abstract phrasing is too strong given the indirect nature of the evidence. We will revise the abstract to summarize key similarity results, replace 'excellent' with 'promising', and add a clause noting that downstream RCA validation remains future work. This makes the quantitative grounding explicit without overstating implications. revision: partial
-
Referee: [Experiments] Experiments section: no direct RCA task evaluation is described, such as substituting the generated KB for the reference KB and measuring engineer time-to-diagnosis, diagnostic accuracy, or number of resolved tickets; lexical overlap and embedding similarity can be high while causal chains or actionability remain incomplete.
Authors: We concur that similarity alone does not guarantee causal completeness or actionability. Our design deliberately uses an expert-curated reference KB as ground truth to enable reproducible, automated assessment at scale. We will expand the Experiments and Limitations sections to explicitly state this proxy limitation, note that high lexical/semantic overlap does not ensure complete causal chains, and add a forward-looking paragraph on planned human-subject studies with network engineers to measure diagnostic accuracy and resolution time. revision: partial
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential reductions
full rationale
The paper conducts an empirical evaluation of three LLM-based methods (Fine-Tuning, RAG, Hybrid) for generating an RCA knowledge base from support tickets. Performance is measured directly via standard lexical (BLEU, ROUGE) and semantic similarity metrics against a reference KB. No equations, parameter fitting, predictions that reduce to fitted inputs, uniqueness theorems, or self-citation chains appear in the derivation. The central claim rests on observed metric values from an external industrial dataset rather than any construction that equates outputs to inputs by definition. This is a standard self-contained empirical study; the similarity-to-utility inference is an interpretive step, not a circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can meaningfully process and structure information from support tickets into a usable knowledge base
Reference graph
Works this paper leans on
-
[1]
Root cause analysis in 5G/6G networks,
D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. Aguiar, “Root cause analysis in 5G/6G networks,” inInternational Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224
work page 2021
-
[2]
A survey on large language model based autonomous agents,
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. Zhao, Z. Wei, and J. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, pp. 1–26, 2024
work page 2024
-
[3]
Privacy in (mobile) telecommunications services,
J. Penders, “Privacy in (mobile) telecommunications services,”Ethics and Information Technology, vol. 6, pp. 247–260, 2004
work page 2004
-
[4]
R. Khan, P. Kumar, D. N. K. Jayakody, and M. Liyanage, “A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,”IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 196–248, 2020
work page 2020
-
[5]
A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,
Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,”High-Confidence Computing, vol. 4, no. 2, p. 100211, 2024
work page 2024
-
[6]
C. Qiu, K. Yang, J. Wang, and S. Zhao, “AI empowered net-RCA for 6G,”IEEE Network, vol. 37, no. 6, pp. 132–140, 2023
work page 2023
-
[7]
A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,
Y . Y . Wee, W. P. Cheah, S. C. Tan, and K. Wee, “A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,” Expert Systems with Applications, vol. 42, no. 1, pp. 468–487, 2015
work page 2015
-
[8]
Service Outages Prediction through Logs and Tickets Analysis,
S. Yadwad, V . Valli, and S. S. B. Venkata, “Service Outages Prediction through Logs and Tickets Analysis,”Int. Journal of Advanced Computer Science and Applications, vol. 12, no. 4, pp. 177 – 183, 2021
work page 2021
-
[9]
An automatic detection and diagnosis framework for mobile communication systems,
P. Szilágyi and S. Nováczki, “An automatic detection and diagnosis framework for mobile communication systems,”IEEE transactions on Network and Service Management, vol. 9, no. 2, pp. 184–197, 2012
work page 2012
-
[10]
An improved anomaly detection and diagnosis framework for mobile network operators,
S. Nováczki, “An improved anomaly detection and diagnosis framework for mobile network operators,” in9th Int. conference on the design of reliable communication networks (DRCN). IEEE, 2013, pp. 234–241
work page 2013
-
[11]
Root Cause Analysis in 5G/6G Networks,
D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. L. Aguiar, “Root Cause Analysis in 5G/6G Networks,” in8th Int. Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224. PHUC ET AL.: LLM-AUGMENTED KNOWLEDGE BASE CONSTRUCTION FOR ROOT CAUSE ANALYSIS 13
work page 2021
-
[12]
Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,
P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei, “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” inInter- national Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 48–58
work page 2020
-
[13]
A scalable multi-factor fault analysis framework for information systems,
H.-H. Phan-Vu, B. Jaumard, T. Glatard, J. Whatley, and S. Nadeau, “A scalable multi-factor fault analysis framework for information systems,” inIEEE Int. Conference on Big Data (Big Data), 2021, pp. 2621–2630
work page 2021
-
[14]
Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,
H. H. Wang, Z. Wu, H. Jiang, Y . Huang, J. Wang, S. Kopru, and T. Xie, “Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,” inIEEE/ACM International Conference on Automated Software Engineering (ASE), 2021, pp. 419–429
work page 2021
-
[15]
Recommending root-cause and mitigation steps for cloud incidents using large language models,
T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inIEEE/ACM Int. Conference on Software Engineering (ICSE), 2023, pp. 1737–1749
work page 2023
-
[16]
Assess and summarize: Improve outage understanding with large language models,
P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhaoet al., “Assess and summarize: Improve outage understanding with large language models,” in31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1657–1668
work page 2023
-
[17]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020
work page 2020
-
[18]
Automatic root cause analysis via large language models for cloud incidents,
Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, Q. Lin, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,”EuroSys’24, 2024
work page 2024
-
[19]
D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and S. Raj- mohan, “PACE-LM: Prompting and Augmentation for Calibrated Con- fidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis,” Microsoft, vol. abs/2309.05833, 2023
-
[20]
Large language models (llms): Hypes and realities,
S. K. Routray, A. Javali, K. P. Sharmila, M. K. Jha, M. Pappa, and M. Singh, “Large language models (llms): Hypes and realities,” in 2023 International Conference on Computer Science and Emerging Technologies (CSET), 2023, pp. 1–6
work page 2023
-
[21]
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,
F. Xue, Y . Fu, W. Zhou, Z. Zheng, and Y . You, “To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,” inAdvances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 59 304–59 322
work page 2023
-
[22]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”GenAI, Meta, vol. abs/2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Efficient domain adaptation of language models via adaptive tokenization,
V . Sachidananda, J. S. Kessler, and Y .-A. Lai, “Efficient domain adaptation of language models via adaptive tokenization,” inEMNLP 2021 Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), 2021
work page 2021
-
[24]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[25]
Fine-Tuning Language Models For Semi-Supervised Text Mining,
X. Chen, I. Beaver, and C. Freeman, “Fine-Tuning Language Models For Semi-Supervised Text Mining,” inIEEE Int. Conference on Big Data (Big Data), 2020, pp. 3608–3617
work page 2020
-
[26]
Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of Machine Learning and Systems, vol. 6, pp. 87–100, 2024
work page 2024
-
[27]
Understanding the performance and estimating the cost of llm fine- tuning,
Y . Xia, J. Kim, Y . Chen, H. Ye, S. Kundu, C. C. Hao, and N. Talati, “Understanding the performance and estimating the cost of llm fine- tuning,” in2024 IEEE Int. Symposium on Workload Characterization (IISWC), 2024, pp. 210–223
work page 2024
-
[28]
Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,
Y . Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,”IEEE Access, vol. 10, pp. 131 374–131 385, 2022
work page 2022
-
[29]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
work page 2002
-
[30]
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,
S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” inACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72
work page 2005
-
[31]
ROUGE: A Package for Automatic Evaluation of Sum- maries,
C.-Y . Lin, “ROUGE: A Package for Automatic Evaluation of Sum- maries,” inText summarization branches out. ACL, 2004, pp. 74–81
work page 2004
-
[32]
BERTScore: Evaluating Text Generation with BERT,
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating Text Generation with BERT,” inInt. Confer- ence on Learning Representations (ICLR), 2020
work page 2020
-
[33]
I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016
work page 2016
-
[34]
Gemma: Open Models Based on Gemini Research and Technology
T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”Google Deep Mind, vol. abs/2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Introducing meta llama 3: The most capable openly available llm to date,
Meta, “Introducing meta llama 3: The most capable openly available llm to date,”Meta AI, 2024
work page 2024
-
[36]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,”Stanford CRFM, 2023
work page 2023
-
[37]
Introducing phi-3: Redefining what’s possible with slms,
Microsoft-GenAI, “Introducing phi-3: Redefining what’s possible with slms,”Microsoft GenAI, 2024
work page 2024
-
[38]
The falcon series of open language models,
E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”Technology Innovation Institute, 2023
work page 2023
-
[39]
PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,
Y . Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie, W. Ye, S. Zhang, and Y . Zhang, “PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,” inInternational Conference on Learning Representations (ICLR), 2024, pp. 1–21
work page 2024
-
[40]
Kgroot: A knowledge graph-enhanced method for root cause analysis,
T. Wang, G. Qi, and T. Wu, “Kgroot: A knowledge graph-enhanced method for root cause analysis,”Expert Systems with Applications, vol. 255, p. 124679, 2024. Nguyen Phuc Tranreceived his M.S. degree in Computer Science from the University of Informa- tion Technology, Vietnam National University, Ho Chi Minh City, in 2020. Since 2021, he has been pur- suing ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.