LLM-Augmented Knowledge Base Construction For Root Cause Analysis

Brigitte Jaumard; Karthikeyan Premkumar; Kun Ni; Nguyen Phuc Tran; Oscar Delgado; Tristan Glatard

arxiv: 2604.06171 · v1 · submitted 2026-01-09 · 💻 cs.CL · cs.AI

LLM-Augmented Knowledge Base Construction For Root Cause Analysis

Nguyen Phuc Tran , Brigitte Jaumard , Oscar Delgado , Tristan Glatard , Karthikeyan Premkumar , Kun Ni This is my paper

Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords root cause analysisknowledge base constructionlarge language modelssupport ticketsnetwork reliabilityRAGfine-tuning

0 comments

The pith

Three LLM approaches construct a knowledge base from support tickets that accelerates root cause analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines three ways to use large language models to build a knowledge base for root cause analysis using data from support tickets in communications networks. The methods include fine-tuning the models, using retrieval-augmented generation, and a combination of both. Performance is measured with various lexical and semantic similarity scores compared to a reference knowledge base. Results from tests on real industrial data indicate that these generated bases offer a good foundation to speed up RCA and boost network reliability. A sympathetic reader would care because it addresses the challenge of maintaining high uptime in complex networks where quick diagnosis of problems is critical.

Core claim

The experiments demonstrate that fine-tuning, RAG, and hybrid LLM methodologies can produce a root cause analysis knowledge base from support tickets that closely matches reference structures according to similarity metrics and thus serves as an excellent starting point for accelerating RCA tasks and improving network resilience.

What carries the argument

Comparison of fine-tuning, retrieval-augmented generation (RAG), and hybrid LLM methods for constructing an RCA knowledge base evaluated by lexical and semantic similarity metrics.

Load-bearing premise

Lexical and semantic similarity metrics accurately reflect how useful the generated knowledge base will be for engineers conducting actual root cause analysis.

What would settle it

A user study where network engineers attempt to diagnose the same set of outages using both the LLM-generated knowledge base and a gold-standard reference, then compare success rates and time taken.

Figures

Figures reproduced from arXiv: 2604.06171 by Brigitte Jaumard, Karthikeyan Premkumar, Kun Ni, Nguyen Phuc Tran, Oscar Delgado, Tristan Glatard.

**Figure 2.** Figure 2: Fine-tuning phase with LoRA and quantization for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Inference pipeline for automated RCA rule generation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Results on the training of the Word2Vec model for the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Network issue distribution: Fine-tune vs. evaluation [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee "five 9s" (99.999 %) reliability, requiring rapid and accurate root cause analysis (RCA) during outages. In the event of an outage, rapid and accurate RCA becomes essential to restore service and prevent future disruptions. This study evaluates three Large Language Model (LLM) methodologies - Fine-Tuning, RAG, and a Hybrid approach - for constructing a Root Cause Analysis (RCA) Knowledge Base from support tickets. We compare their performance using a comprehensive suite of lexical and semantic similarity metrics. Our experiments on a real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies standard LLM methods to build an RCA knowledge base from tickets on real data, but its claim of usefulness rests only on unvalidated similarity scores.

read the letter

The core of this work is a direct comparison of fine-tuning, RAG, and a hybrid LLM setup for turning support tickets into a knowledge base aimed at root cause analysis in communications networks. They run the three approaches on an industrial dataset and score the outputs against a reference KB using lexical and semantic similarity measures. That is the new piece: a side-by-side empirical test on this narrow task with actual company data rather than synthetic examples. The setup itself is clean and the choice of metrics is reasonable for an initial check. Credit for grounding the experiments in a real outage-recovery scenario instead of abstract benchmarks. The limitation is straightforward. The abstract asserts that the generated KB supplies an excellent starting point for accelerating RCA and improving resilience, yet the only evidence offered is similarity to the reference. No numbers are given, no baselines are compared, and there is no test of whether engineers actually diagnose faster or more accurately when they use the generated KB instead of the original. Lexical and embedding overlap can be high while missing critical causal chains or action steps. That gap makes the main claim hard to assess from what is shown. This paper is for applied researchers or engineers working on network operations who want to see how off-the-shelf LLM techniques behave on ticket data. A reader looking for a new method or a proven performance lift will not find it here. It is worth a serious referee if the authors add a direct RCA task evaluation or at least report the actual metric values with error bars and baselines. Without that, it stays at the level of a promising pilot.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates three LLM-based methodologies—Fine-Tuning, RAG, and a Hybrid approach—for constructing a Root Cause Analysis (RCA) knowledge base from industrial support tickets in communications networks. Performance is assessed via a suite of lexical and semantic similarity metrics against a reference KB, with the conclusion that the generated KB supplies an excellent starting point for accelerating RCA tasks and improving network resilience.

Significance. If the proxy similarity metrics were shown to correlate with downstream RCA utility, the methods could reduce manual effort in building domain-specific KBs and support faster outage resolution. The work targets a high-stakes industrial problem, but the current evaluation provides only indirect evidence, limiting claims about resilience improvements.

major comments (2)

[Abstract] Abstract: the central claim that the generated KB provides an 'excellent starting point' for RCA and improves network resilience rests entirely on lexical/semantic similarity scores; no quantitative metric values, baseline comparisons, error bars, or validation that these scores predict actual RCA task performance (e.g., diagnostic accuracy or time-to-resolution) are reported.
[Experiments] Experiments section: no direct RCA task evaluation is described, such as substituting the generated KB for the reference KB and measuring engineer time-to-diagnosis, diagnostic accuracy, or number of resolved tickets; lexical overlap and embedding similarity can be high while causal chains or actionability remain incomplete.

minor comments (2)

[Methodology] The description of the Hybrid approach should specify exactly how the outputs of Fine-Tuning and RAG are combined and how any conflicts are resolved.
[Evaluation Metrics] All similarity metrics (BLEU, ROUGE, cosine similarity, etc.) should be accompanied by explicit formulas, implementation details, and the precise reference KB construction process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the distinction between proxy metrics and direct task utility. We address each major comment below and outline targeted revisions to clarify the scope and limitations of our evaluation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the generated KB provides an 'excellent starting point' for RCA and improves network resilience rests entirely on lexical/semantic similarity scores; no quantitative metric values, baseline comparisons, error bars, or validation that these scores predict actual RCA task performance (e.g., diagnostic accuracy or time-to-resolution) are reported.

Authors: The Experiments section reports concrete metric values (e.g., BLEU, ROUGE, BERTScore, and embedding cosine similarities), method comparisons, and variability across runs. We agree the abstract phrasing is too strong given the indirect nature of the evidence. We will revise the abstract to summarize key similarity results, replace 'excellent' with 'promising', and add a clause noting that downstream RCA validation remains future work. This makes the quantitative grounding explicit without overstating implications. revision: partial
Referee: [Experiments] Experiments section: no direct RCA task evaluation is described, such as substituting the generated KB for the reference KB and measuring engineer time-to-diagnosis, diagnostic accuracy, or number of resolved tickets; lexical overlap and embedding similarity can be high while causal chains or actionability remain incomplete.

Authors: We concur that similarity alone does not guarantee causal completeness or actionability. Our design deliberately uses an expert-curated reference KB as ground truth to enable reproducible, automated assessment at scale. We will expand the Experiments and Limitations sections to explicitly state this proxy limitation, note that high lexical/semantic overlap does not ensure complete causal chains, and add a forward-looking paragraph on planned human-subject studies with network engineers to measure diagnostic accuracy and resolution time. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper conducts an empirical evaluation of three LLM-based methods (Fine-Tuning, RAG, Hybrid) for generating an RCA knowledge base from support tickets. Performance is measured directly via standard lexical (BLEU, ROUGE) and semantic similarity metrics against a reference KB. No equations, parameter fitting, predictions that reduce to fitted inputs, uniqueness theorems, or self-citation chains appear in the derivation. The central claim rests on observed metric values from an external industrial dataset rather than any construction that equates outputs to inputs by definition. This is a standard self-contained empirical study; the similarity-to-utility inference is an interpretive step, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly assumes standard LLM capabilities for text processing and that similarity metrics proxy for RCA utility.

axioms (1)

domain assumption Large language models can meaningfully process and structure information from support tickets into a usable knowledge base
This underpins all three methodologies evaluated in the abstract.

pith-pipeline@v0.9.0 · 5449 in / 1164 out tokens · 28519 ms · 2026-05-16T15:43:20.419583+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Root cause analysis in 5G/6G networks,

D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. Aguiar, “Root cause analysis in 5G/6G networks,” inInternational Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224

work page 2021
[2]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. Zhao, Z. Wei, and J. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, pp. 1–26, 2024

work page 2024
[3]

Privacy in (mobile) telecommunications services,

J. Penders, “Privacy in (mobile) telecommunications services,”Ethics and Information Technology, vol. 6, pp. 247–260, 2004

work page 2004
[4]

A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,

R. Khan, P. Kumar, D. N. K. Jayakody, and M. Liyanage, “A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,”IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 196–248, 2020

work page 2020
[5]

A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,

Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,”High-Confidence Computing, vol. 4, no. 2, p. 100211, 2024

work page 2024
[6]

AI empowered net-RCA for 6G,

C. Qiu, K. Yang, J. Wang, and S. Zhao, “AI empowered net-RCA for 6G,”IEEE Network, vol. 37, no. 6, pp. 132–140, 2023

work page 2023
[7]

A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,

Y . Y . Wee, W. P. Cheah, S. C. Tan, and K. Wee, “A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,” Expert Systems with Applications, vol. 42, no. 1, pp. 468–487, 2015

work page 2015
[8]

Service Outages Prediction through Logs and Tickets Analysis,

S. Yadwad, V . Valli, and S. S. B. Venkata, “Service Outages Prediction through Logs and Tickets Analysis,”Int. Journal of Advanced Computer Science and Applications, vol. 12, no. 4, pp. 177 – 183, 2021

work page 2021
[9]

An automatic detection and diagnosis framework for mobile communication systems,

P. Szilágyi and S. Nováczki, “An automatic detection and diagnosis framework for mobile communication systems,”IEEE transactions on Network and Service Management, vol. 9, no. 2, pp. 184–197, 2012

work page 2012
[10]

An improved anomaly detection and diagnosis framework for mobile network operators,

S. Nováczki, “An improved anomaly detection and diagnosis framework for mobile network operators,” in9th Int. conference on the design of reliable communication networks (DRCN). IEEE, 2013, pp. 234–241

work page 2013
[11]

Root Cause Analysis in 5G/6G Networks,

D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. L. Aguiar, “Root Cause Analysis in 5G/6G Networks,” in8th Int. Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224. PHUC ET AL.: LLM-AUGMENTED KNOWLEDGE BASE CONSTRUCTION FOR ROOT CAUSE ANALYSIS 13

work page 2021
[12]

Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,

P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei, “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” inInter- national Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 48–58

work page 2020
[13]

A scalable multi-factor fault analysis framework for information systems,

H.-H. Phan-Vu, B. Jaumard, T. Glatard, J. Whatley, and S. Nadeau, “A scalable multi-factor fault analysis framework for information systems,” inIEEE Int. Conference on Big Data (Big Data), 2021, pp. 2621–2630

work page 2021
[14]

Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,

H. H. Wang, Z. Wu, H. Jiang, Y . Huang, J. Wang, S. Kopru, and T. Xie, “Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,” inIEEE/ACM International Conference on Automated Software Engineering (ASE), 2021, pp. 419–429

work page 2021
[15]

Recommending root-cause and mitigation steps for cloud incidents using large language models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inIEEE/ACM Int. Conference on Software Engineering (ICSE), 2023, pp. 1737–1749

work page 2023
[16]

Assess and summarize: Improve outage understanding with large language models,

P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhaoet al., “Assess and summarize: Improve outage understanding with large language models,” in31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1657–1668

work page 2023
[17]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020

work page 2020
[18]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, Q. Lin, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,”EuroSys’24, 2024

work page 2024
[19]

Pace: Prompting and augmentation for calibrated confidence estimation with gpt-4 in cloud incident root cause anal- ysis

D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and S. Raj- mohan, “PACE-LM: Prompting and Augmentation for Calibrated Con- fidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis,” Microsoft, vol. abs/2309.05833, 2023

work page arXiv 2023
[20]

Large language models (llms): Hypes and realities,

S. K. Routray, A. Javali, K. P. Sharmila, M. K. Jha, M. Pappa, and M. Singh, “Large language models (llms): Hypes and realities,” in 2023 International Conference on Computer Science and Emerging Technologies (CSET), 2023, pp. 1–6

work page 2023
[21]

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,

F. Xue, Y . Fu, W. Zhou, Z. Zheng, and Y . You, “To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,” inAdvances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 59 304–59 322

work page 2023
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”GenAI, Meta, vol. abs/2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Efficient domain adaptation of language models via adaptive tokenization,

V . Sachidananda, J. S. Kessler, and Y .-A. Lai, “Efficient domain adaptation of language models via adaptive tokenization,” inEMNLP 2021 Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), 2021

work page 2021
[24]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[25]

Fine-Tuning Language Models For Semi-Supervised Text Mining,

X. Chen, I. Beaver, and C. Freeman, “Fine-Tuning Language Models For Semi-Supervised Text Mining,” inIEEE Int. Conference on Big Data (Big Data), 2020, pp. 3608–3617

work page 2020
[26]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of Machine Learning and Systems, vol. 6, pp. 87–100, 2024

work page 2024
[27]

Understanding the performance and estimating the cost of llm fine- tuning,

Y . Xia, J. Kim, Y . Chen, H. Ye, S. Kundu, C. C. Hao, and N. Talati, “Understanding the performance and estimating the cost of llm fine- tuning,” in2024 IEEE Int. Symposium on Workload Characterization (IISWC), 2024, pp. 210–223

work page 2024
[28]

Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,

Y . Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,”IEEE Access, vol. 10, pp. 131 374–131 385, 2022

work page 2022
[29]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002
[30]

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,

S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” inACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

work page 2005
[31]

ROUGE: A Package for Automatic Evaluation of Sum- maries,

C.-Y . Lin, “ROUGE: A Package for Automatic Evaluation of Sum- maries,” inText summarization branches out. ACL, 2004, pp. 74–81

work page 2004
[32]

BERTScore: Evaluating Text Generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating Text Generation with BERT,” inInt. Confer- ence on Learning Representations (ICLR), 2020

work page 2020
[33]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016

work page 2016
[34]

Gemma: Open Models Based on Gemini Research and Technology

T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”Google Deep Mind, vol. abs/2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Introducing meta llama 3: The most capable openly available llm to date,

Meta, “Introducing meta llama 3: The most capable openly available llm to date,”Meta AI, 2024

work page 2024
[36]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,”Stanford CRFM, 2023

work page 2023
[37]

Introducing phi-3: Redefining what’s possible with slms,

Microsoft-GenAI, “Introducing phi-3: Redefining what’s possible with slms,”Microsoft GenAI, 2024

work page 2024
[38]

The falcon series of open language models,

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”Technology Innovation Institute, 2023

work page 2023
[39]

PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,

Y . Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie, W. Ye, S. Zhang, and Y . Zhang, “PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,” inInternational Conference on Learning Representations (ICLR), 2024, pp. 1–21

work page 2024
[40]

Kgroot: A knowledge graph-enhanced method for root cause analysis,

T. Wang, G. Qi, and T. Wu, “Kgroot: A knowledge graph-enhanced method for root cause analysis,”Expert Systems with Applications, vol. 255, p. 124679, 2024. Nguyen Phuc Tranreceived his M.S. degree in Computer Science from the University of Informa- tion Technology, Vietnam National University, Ho Chi Minh City, in 2020. Since 2021, he has been pur- suing ...

work page 2024

[1] [1]

Root cause analysis in 5G/6G networks,

D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. Aguiar, “Root cause analysis in 5G/6G networks,” inInternational Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224

work page 2021

[2] [2]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. Zhao, Z. Wei, and J. Wen, “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, pp. 1–26, 2024

work page 2024

[3] [3]

Privacy in (mobile) telecommunications services,

J. Penders, “Privacy in (mobile) telecommunications services,”Ethics and Information Technology, vol. 6, pp. 247–260, 2004

work page 2004

[4] [4]

A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,

R. Khan, P. Kumar, D. N. K. Jayakody, and M. Liyanage, “A survey on security and privacy of 5G technologies: Potential solutions, recent advancements, and future directions,”IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 196–248, 2020

work page 2020

[5] [5]

A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,

Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly,”High-Confidence Computing, vol. 4, no. 2, p. 100211, 2024

work page 2024

[6] [6]

AI empowered net-RCA for 6G,

C. Qiu, K. Yang, J. Wang, and S. Zhao, “AI empowered net-RCA for 6G,”IEEE Network, vol. 37, no. 6, pp. 132–140, 2023

work page 2023

[7] [7]

A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,

Y . Y . Wee, W. P. Cheah, S. C. Tan, and K. Wee, “A method for root cause analysis with a Bayesian belief network and fuzzy cognitive map,” Expert Systems with Applications, vol. 42, no. 1, pp. 468–487, 2015

work page 2015

[8] [8]

Service Outages Prediction through Logs and Tickets Analysis,

S. Yadwad, V . Valli, and S. S. B. Venkata, “Service Outages Prediction through Logs and Tickets Analysis,”Int. Journal of Advanced Computer Science and Applications, vol. 12, no. 4, pp. 177 – 183, 2021

work page 2021

[9] [9]

An automatic detection and diagnosis framework for mobile communication systems,

P. Szilágyi and S. Nováczki, “An automatic detection and diagnosis framework for mobile communication systems,”IEEE transactions on Network and Service Management, vol. 9, no. 2, pp. 184–197, 2012

work page 2012

[10] [10]

An improved anomaly detection and diagnosis framework for mobile network operators,

S. Nováczki, “An improved anomaly detection and diagnosis framework for mobile network operators,” in9th Int. conference on the design of reliable communication networks (DRCN). IEEE, 2013, pp. 234–241

work page 2013

[11] [11]

Root Cause Analysis in 5G/6G Networks,

D. Canastro, R. Rocha, M. Antunes, D. Gomes, and R. L. Aguiar, “Root Cause Analysis in 5G/6G Networks,” in8th Int. Conference on Future Internet of Things and Cloud (FiCloud), 2021, pp. 217–224. PHUC ET AL.: LLM-AUGMENTED KNOWLEDGE BASE CONSTRUCTION FOR ROOT CAUSE ANALYSIS 13

work page 2021

[12] [12]

Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,

P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue, and D. Pei, “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” inInter- national Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 48–58

work page 2020

[13] [13]

A scalable multi-factor fault analysis framework for information systems,

H.-H. Phan-Vu, B. Jaumard, T. Glatard, J. Whatley, and S. Nadeau, “A scalable multi-factor fault analysis framework for information systems,” inIEEE Int. Conference on Big Data (Big Data), 2021, pp. 2621–2630

work page 2021

[14] [14]

Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,

H. H. Wang, Z. Wu, H. Jiang, Y . Huang, J. Wang, S. Kopru, and T. Xie, “Groot: An event-graph-based approach for root cause analysis in in- dustrial settings,” inIEEE/ACM International Conference on Automated Software Engineering (ASE), 2021, pp. 419–429

work page 2021

[15] [15]

Recommending root-cause and mitigation steps for cloud incidents using large language models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” inIEEE/ACM Int. Conference on Software Engineering (ICSE), 2023, pp. 1737–1749

work page 2023

[16] [16]

Assess and summarize: Improve outage understanding with large language models,

P. Jin, S. Zhang, M. Ma, H. Li, Y . Kang, L. Li, Y . Liu, B. Qiao, C. Zhang, P. Zhaoet al., “Assess and summarize: Improve outage understanding with large language models,” in31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1657–1668

work page 2023

[17] [17]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020

work page 2020

[18] [18]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, Q. Lin, S. Rajmohan, and D. Zhang, “Automatic root cause analysis via large language models for cloud incidents,”EuroSys’24, 2024

work page 2024

[19] [19]

Pace: Prompting and augmentation for calibrated confidence estimation with gpt-4 in cloud incident root cause anal- ysis

D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and S. Raj- mohan, “PACE-LM: Prompting and Augmentation for Calibrated Con- fidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis,” Microsoft, vol. abs/2309.05833, 2023

work page arXiv 2023

[20] [20]

Large language models (llms): Hypes and realities,

S. K. Routray, A. Javali, K. P. Sharmila, M. K. Jha, M. Pappa, and M. Singh, “Large language models (llms): Hypes and realities,” in 2023 International Conference on Computer Science and Emerging Technologies (CSET), 2023, pp. 1–6

work page 2023

[21] [21]

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,

F. Xue, Y . Fu, W. Zhou, Z. Zheng, and Y . You, “To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis,” inAdvances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 59 304–59 322

work page 2023

[22] [22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”GenAI, Meta, vol. abs/2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Efficient domain adaptation of language models via adaptive tokenization,

V . Sachidananda, J. S. Kessler, and Y .-A. Lai, “Efficient domain adaptation of language models via adaptive tokenization,” inEMNLP 2021 Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), 2021

work page 2021

[24] [24]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[25] [25]

Fine-Tuning Language Models For Semi-Supervised Text Mining,

X. Chen, I. Beaver, and C. Freeman, “Fine-Tuning Language Models For Semi-Supervised Text Mining,” inIEEE Int. Conference on Big Data (Big Data), 2020, pp. 3608–3617

work page 2020

[26] [26]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,”Proceedings of Machine Learning and Systems, vol. 6, pp. 87–100, 2024

work page 2024

[27] [27]

Understanding the performance and estimating the cost of llm fine- tuning,

Y . Xia, J. Kim, Y . Chen, H. Ye, S. Kundu, C. C. Hao, and N. Talati, “Understanding the performance and estimating the cost of llm fine- tuning,” in2024 IEEE Int. Symposium on Workload Characterization (IISWC), 2024, pp. 210–223

work page 2024

[28] [28]

Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,

Y . Ahn, S.-G. Lee, J. Shim, and J. Park, “Retrieval-Augmented Response Generation for Knowledge-Grounded Conversation in the Wild,”IEEE Access, vol. 10, pp. 131 374–131 385, 2022

work page 2022

[29] [29]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

work page 2002

[30] [30]

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,

S. Banerjee and A. Lavie, “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments,” inACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72

work page 2005

[31] [31]

ROUGE: A Package for Automatic Evaluation of Sum- maries,

C.-Y . Lin, “ROUGE: A Package for Automatic Evaluation of Sum- maries,” inText summarization branches out. ACL, 2004, pp. 74–81

work page 2004

[32] [32]

BERTScore: Evaluating Text Generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating Text Generation with BERT,” inInt. Confer- ence on Learning Representations (ICLR), 2020

work page 2020

[33] [33]

Goodfellow, Y

I. Goodfellow, Y . Bengio, and A. Courville,Deep Learning. MIT Press, 2016

work page 2016

[34] [34]

Gemma: Open Models Based on Gemini Research and Technology

T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”Google Deep Mind, vol. abs/2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Introducing meta llama 3: The most capable openly available llm to date,

Meta, “Introducing meta llama 3: The most capable openly available llm to date,”Meta AI, 2024

work page 2024

[36] [36]

Mistral 7b,

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,”Stanford CRFM, 2023

work page 2023

[37] [37]

Introducing phi-3: Redefining what’s possible with slms,

Microsoft-GenAI, “Introducing phi-3: Redefining what’s possible with slms,”Microsoft GenAI, 2024

work page 2024

[38] [38]

The falcon series of open language models,

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malarticet al., “The falcon series of open language models,”Technology Innovation Institute, 2023

work page 2023

[39] [39]

PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,

Y . Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie, W. Ye, S. Zhang, and Y . Zhang, “PandaLM: An auto- matic evaluation benchmark for LLM instruction tuning optimization,” inInternational Conference on Learning Representations (ICLR), 2024, pp. 1–21

work page 2024

[40] [40]

Kgroot: A knowledge graph-enhanced method for root cause analysis,

T. Wang, G. Qi, and T. Wu, “Kgroot: A knowledge graph-enhanced method for root cause analysis,”Expert Systems with Applications, vol. 255, p. 124679, 2024. Nguyen Phuc Tranreceived his M.S. degree in Computer Science from the University of Informa- tion Technology, Vietnam National University, Ho Chi Minh City, in 2020. Since 2021, he has been pur- suing ...

work page 2024