pith. sign in

arxiv: 2604.06171 · v1 · submitted 2026-01-09 · 💻 cs.CL · cs.AI

LLM-Augmented Knowledge Base Construction For Root Cause Analysis

Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords root cause analysisknowledge base constructionlarge language modelssupport ticketsnetwork reliabilityRAGfine-tuning
0
0 comments X

The pith

Three LLM approaches construct a knowledge base from support tickets that accelerates root cause analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines three ways to use large language models to build a knowledge base for root cause analysis using data from support tickets in communications networks. The methods include fine-tuning the models, using retrieval-augmented generation, and a combination of both. Performance is measured with various lexical and semantic similarity scores compared to a reference knowledge base. Results from tests on real industrial data indicate that these generated bases offer a good foundation to speed up RCA and boost network reliability. A sympathetic reader would care because it addresses the challenge of maintaining high uptime in complex networks where quick diagnosis of problems is critical.

Core claim

The experiments demonstrate that fine-tuning, RAG, and hybrid LLM methodologies can produce a root cause analysis knowledge base from support tickets that closely matches reference structures according to similarity metrics and thus serves as an excellent starting point for accelerating RCA tasks and improving network resilience.

What carries the argument

Comparison of fine-tuning, retrieval-augmented generation (RAG), and hybrid LLM methods for constructing an RCA knowledge base evaluated by lexical and semantic similarity metrics.

Load-bearing premise

Lexical and semantic similarity metrics accurately reflect how useful the generated knowledge base will be for engineers conducting actual root cause analysis.

What would settle it

A user study where network engineers attempt to diagnose the same set of outages using both the LLM-generated knowledge base and a gold-standard reference, then compare success rates and time taken.

Figures

Figures reproduced from arXiv: 2604.06171 by Brigitte Jaumard, Karthikeyan Premkumar, Kun Ni, Nguyen Phuc Tran, Oscar Delgado, Tristan Glatard.

Figure 1
Figure 1. Figure 1: Distribution of token lengths in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fine-tuning phase with LoRA and quantization for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inference pipeline for automated RCA rule generation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results on the training of the Word2Vec model for the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Network issue distribution: Fine-tune vs. evaluation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee "five 9s" (99.999 %) reliability, requiring rapid and accurate root cause analysis (RCA) during outages. In the event of an outage, rapid and accurate RCA becomes essential to restore service and prevent future disruptions. This study evaluates three Large Language Model (LLM) methodologies - Fine-Tuning, RAG, and a Hybrid approach - for constructing a Root Cause Analysis (RCA) Knowledge Base from support tickets. We compare their performance using a comprehensive suite of lexical and semantic similarity metrics. Our experiments on a real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates three LLM-based methodologies—Fine-Tuning, RAG, and a Hybrid approach—for constructing a Root Cause Analysis (RCA) knowledge base from industrial support tickets in communications networks. Performance is assessed via a suite of lexical and semantic similarity metrics against a reference KB, with the conclusion that the generated KB supplies an excellent starting point for accelerating RCA tasks and improving network resilience.

Significance. If the proxy similarity metrics were shown to correlate with downstream RCA utility, the methods could reduce manual effort in building domain-specific KBs and support faster outage resolution. The work targets a high-stakes industrial problem, but the current evaluation provides only indirect evidence, limiting claims about resilience improvements.

major comments (2)
  1. [Abstract] Abstract: the central claim that the generated KB provides an 'excellent starting point' for RCA and improves network resilience rests entirely on lexical/semantic similarity scores; no quantitative metric values, baseline comparisons, error bars, or validation that these scores predict actual RCA task performance (e.g., diagnostic accuracy or time-to-resolution) are reported.
  2. [Experiments] Experiments section: no direct RCA task evaluation is described, such as substituting the generated KB for the reference KB and measuring engineer time-to-diagnosis, diagnostic accuracy, or number of resolved tickets; lexical overlap and embedding similarity can be high while causal chains or actionability remain incomplete.
minor comments (2)
  1. [Methodology] The description of the Hybrid approach should specify exactly how the outputs of Fine-Tuning and RAG are combined and how any conflicts are resolved.
  2. [Evaluation Metrics] All similarity metrics (BLEU, ROUGE, cosine similarity, etc.) should be accompanied by explicit formulas, implementation details, and the precise reference KB construction process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the distinction between proxy metrics and direct task utility. We address each major comment below and outline targeted revisions to clarify the scope and limitations of our evaluation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the generated KB provides an 'excellent starting point' for RCA and improves network resilience rests entirely on lexical/semantic similarity scores; no quantitative metric values, baseline comparisons, error bars, or validation that these scores predict actual RCA task performance (e.g., diagnostic accuracy or time-to-resolution) are reported.

    Authors: The Experiments section reports concrete metric values (e.g., BLEU, ROUGE, BERTScore, and embedding cosine similarities), method comparisons, and variability across runs. We agree the abstract phrasing is too strong given the indirect nature of the evidence. We will revise the abstract to summarize key similarity results, replace 'excellent' with 'promising', and add a clause noting that downstream RCA validation remains future work. This makes the quantitative grounding explicit without overstating implications. revision: partial

  2. Referee: [Experiments] Experiments section: no direct RCA task evaluation is described, such as substituting the generated KB for the reference KB and measuring engineer time-to-diagnosis, diagnostic accuracy, or number of resolved tickets; lexical overlap and embedding similarity can be high while causal chains or actionability remain incomplete.

    Authors: We concur that similarity alone does not guarantee causal completeness or actionability. Our design deliberately uses an expert-curated reference KB as ground truth to enable reproducible, automated assessment at scale. We will expand the Experiments and Limitations sections to explicitly state this proxy limitation, note that high lexical/semantic overlap does not ensure complete causal chains, and add a forward-looking paragraph on planned human-subject studies with network engineers to measure diagnostic accuracy and resolution time. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper conducts an empirical evaluation of three LLM-based methods (Fine-Tuning, RAG, Hybrid) for generating an RCA knowledge base from support tickets. Performance is measured directly via standard lexical (BLEU, ROUGE) and semantic similarity metrics against a reference KB. No equations, parameter fitting, predictions that reduce to fitted inputs, uniqueness theorems, or self-citation chains appear in the derivation. The central claim rests on observed metric values from an external industrial dataset rather than any construction that equates outputs to inputs by definition. This is a standard self-contained empirical study; the similarity-to-utility inference is an interpretive step, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly assumes standard LLM capabilities for text processing and that similarity metrics proxy for RCA utility.

axioms (1)
  • domain assumption Large language models can meaningfully process and structure information from support tickets into a usable knowledge base
    This underpins all three methodologies evaluated in the abstract.

pith-pipeline@v0.9.0 · 5449 in / 1164 out tokens · 28519 ms · 2026-05-16T15:43:20.419583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

  1. [1]

    Root cause analysis in 5G/6G networks,

    D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. Aguiar, “Root cause analysis in 5G/6G networks,” inInternational Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224

  2. [2]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. Zhao, Z. Wei, and J. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, pp. 1–26, 2024

  3. [3]

    Privacy in (mobile) telecommunications services,

    J. Penders, “Privacy in (mobile) telecommunications services,”Ethics and Information Technology, vol. 6, pp. 247–260, 2004

  4. [4]

    A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,

    R. Khan, P. Kumar, D. N. K. Jayakody, and M. Liyanage, “A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,”IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 196–248, 2020

  5. [5]

    A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,

    Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,”High-Confidence Computing, vol. 4, no. 2, p. 100211, 2024

  6. [6]

    AI empowered net-RCA for 6G,

    C. Qiu, K. Yang, J. Wang, and S. Zhao, “AI empowered net-RCA for 6G,”IEEE Network, vol. 37, no. 6, pp. 132–140, 2023

  7. [7]

    A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,

    Y . Y . Wee, W. P. Cheah, S. C. Tan, and K. Wee, “A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,” Expert Systems with Applications, vol. 42, no. 1, pp. 468–487, 2015

  8. [8]

    Service Outages Prediction through Logs and Tickets Analysis,

    S. Yadwad, V . Valli, and S. S. B. Venkata, “Service Outages Prediction through Logs and Tickets Analysis,”Int. Journal of Advanced Computer Science and Applications, vol. 12, no. 4, pp. 177 – 183, 2021

  9. [9]

    An automatic detection and diagnosis framework for mobile communication systems,

    P. Szilágyi and S. Nováczki, “An automatic detection and diagnosis framework for mobile communication systems,”IEEE transactions on Network and Service Management, vol. 9, no. 2, pp. 184–197, 2012

  10. [10]

    An improved anomaly detection and diagnosis framework for mobile network operators,

    S. Nováczki, “An improved anomaly detection and diagnosis framework for mobile network operators,” in9th Int. conference on the design of reliable communication networks (DRCN). IEEE, 2013, pp. 234–241

  11. [11]

    Root Cause Analysis in 5G/6G Networks,

    D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. L. Aguiar, “Root Cause Analysis in 5G/6G Networks,” in8th Int. Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224. PHUC ET AL.: LLM-AUGMENTED KNOWLEDGE BASE CONSTRUCTION FOR ROOT CAUSE ANALYSIS 13

  12. [12]

    Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,

    P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei, “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” inInter- national Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 48–58

  13. [13]

    A scalable multi-factor fault analysis framework for information systems,

    H.-H. Phan-Vu, B. Jaumard, T. Glatard, J. Whatley, and S. Nadeau, “A scalable multi-factor fault analysis framework for information systems,” inIEEE Int. Conference on Big Data (Big Data), 2021, pp. 2621–2630

  14. [14]

    Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,

    H. H. Wang, Z. Wu, H. Jiang, Y . Huang, J. Wang, S. Kopru, and T. Xie, “Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,” inIEEE/ACM International Conference on Automated Software Engineering (ASE), 2021, pp. 419–429

  15. [15]

    Recommending root-cause and mitigation steps for cloud incidents using large language models,

    T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inIEEE/ACM Int. Conference on Software Engineering (ICSE), 2023, pp. 1737–1749

  16. [16]

    Assess and summarize: Improve outage understanding with large language models,

    P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhaoet al., “Assess and summarize: Improve outage understanding with large language models,” in31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1657–1668

  17. [17]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020

  18. [18]

    Automatic root cause analysis via large language models for cloud incidents,

    Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, Q. Lin, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,”EuroSys’24, 2024

  19. [19]

    Pace: Prompting and augmentation for calibrated confidence estimation with gpt-4 in cloud incident root cause anal- ysis

    D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and S. Raj- mohan, “PACE-LM: Prompting and Augmentation for Calibrated Con- fidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis,” Microsoft, vol. abs/2309.05833, 2023

  20. [20]

    Large language models (llms): Hypes and realities,

    S. K. Routray, A. Javali, K. P. Sharmila, M. K. Jha, M. Pappa, and M. Singh, “Large language models (llms): Hypes and realities,” in 2023 International Conference on Computer Science and Emerging Technologies (CSET), 2023, pp. 1–6

  21. [21]

    To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,

    F. Xue, Y . Fu, W. Zhou, Z. Zheng, and Y . You, “To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,” inAdvances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 59 304–59 322

  22. [22]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”GenAI, Meta, vol. abs/2307.09288, 2023

  23. [23]

    Efficient domain adaptation of language models via adaptive tokenization,

    V . Sachidananda, J. S. Kessler, and Y .-A. Lai, “Efficient domain adaptation of language models via adaptive tokenization,” inEMNLP 2021 Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), 2021

  24. [24]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

  25. [25]

    Fine-Tuning Language Models For Semi-Supervised Text Mining,

    X. Chen, I. Beaver, and C. Freeman, “Fine-Tuning Language Models For Semi-Supervised Text Mining,” inIEEE Int. Conference on Big Data (Big Data), 2020, pp. 3608–3617

  26. [26]

    Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of Machine Learning and Systems, vol. 6, pp. 87–100, 2024

  27. [27]

    Understanding the performance and estimating the cost of llm fine- tuning,

    Y . Xia, J. Kim, Y . Chen, H. Ye, S. Kundu, C. C. Hao, and N. Talati, “Understanding the performance and estimating the cost of llm fine- tuning,” in2024 IEEE Int. Symposium on Workload Characterization (IISWC), 2024, pp. 210–223

  28. [28]

    Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,

    Y . Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,”IEEE Access, vol. 10, pp. 131 374–131 385, 2022

  29. [29]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  30. [30]

    METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,

    S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” inACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

  31. [31]

    ROUGE: A Package for Automatic Evaluation of Sum- maries,

    C.-Y . Lin, “ROUGE: A Package for Automatic Evaluation of Sum- maries,” inText summarization branches out. ACL, 2004, pp. 74–81

  32. [32]

    BERTScore: Evaluating Text Generation with BERT,

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating Text Generation with BERT,” inInt. Confer- ence on Learning Representations (ICLR), 2020

  33. [33]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016

  34. [34]

    Gemma: Open Models Based on Gemini Research and Technology

    T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”Google Deep Mind, vol. abs/2403.08295, 2024

  35. [35]

    Introducing meta llama 3: The most capable openly available llm to date,

    Meta, “Introducing meta llama 3: The most capable openly available llm to date,”Meta AI, 2024

  36. [36]

    Mistral 7b,

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,”Stanford CRFM, 2023

  37. [37]

    Introducing phi-3: Redefining what’s possible with slms,

    Microsoft-GenAI, “Introducing phi-3: Redefining what’s possible with slms,”Microsoft GenAI, 2024

  38. [38]

    The falcon series of open language models,

    E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”Technology Innovation Institute, 2023

  39. [39]

    PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,

    Y . Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie, W. Ye, S. Zhang, and Y . Zhang, “PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,” inInternational Conference on Learning Representations (ICLR), 2024, pp. 1–21

  40. [40]

    Kgroot: A knowledge graph-enhanced method for root cause analysis,

    T. Wang, G. Qi, and T. Wu, “Kgroot: A knowledge graph-enhanced method for root cause analysis,”Expert Systems with Applications, vol. 255, p. 124679, 2024. Nguyen Phuc Tranreceived his M.S. degree in Computer Science from the University of Informa- tion Technology, Vietnam National University, Ho Chi Minh City, in 2020. Since 2021, he has been pur- suing ...