pith. sign in

arxiv: 2605.19180 · v1 · pith:46MMHENPnew · submitted 2026-05-18 · 💻 cs.SE

Supporting System Testing with a Multi-Agent LLM-based Framework for Knowledge Graph Extraction: A Case Study with Ethernet Switch Systems

Pith reviewed 2026-05-20 08:28 UTC · model grok-4.3

classification 💻 cs.SE
keywords knowledge graph extractionmulti-agent LLMsystem testingEthernet switch configurationtest case generationconfiguration manualsiterative refinement
0
0 comments X

The pith

A multi-agent LLM framework extracts knowledge graphs from Ethernet switch manuals at 0.97-0.99 correctness to support test automation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that uses multiple LLM agents to extract structured knowledge graphs from Ethernet switch configuration manuals, which are typically semi-structured with hidden details and linked sections. A detailed schema defines the graph structure while an iterative extract-evaluate-improve loop refines the output until it meets quality checks. Evaluation across fifty real manuals shows the process yields high correctness scores, aligns closely with human reviewers, and produces graphs that industry testers find useful for building test cases.

Core claim

The multi-agent LLM-based framework extracts knowledge graphs from Ethernet switch configuration manuals using a fine-grained KG schema and an iterative Extract-Evaluate-Improve loop. On 50 real-world manuals the original prompts deliver average correctness scores of 0.97 to 0.99 across three extraction tasks, with further gains on difficult manuals through targeted prompt refinement. LLM judgments agree with human evaluations at Cohen's kappa of at least 0.72, and the resulting graphs support generation of useful and correct test case specifications for downstream system testing.

What carries the argument

Multi-agent LLM framework with fine-grained knowledge graph schema and iterative Extract-Evaluate-Improve loop that structures configuration knowledge from semi-structured technical manuals.

If this is right

  • The generated knowledge graphs enable direct creation of test case specifications for Ethernet switch systems.
  • Targeted prompt refinement inside the EEI loop raises correctness on the most complex manuals.
  • Substantial agreement between LLM and human evaluations indicates the extraction outputs are reliable enough for industrial use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same multi-agent and iterative approach could convert other semi-structured industrial documents into machine-readable form for testing or analysis.
  • Embedding the resulting graphs into existing test-generation tools would further reduce manual effort in hardware and software validation workflows.
  • Full automation of the refinement step could eventually remove the need for human prompt adjustments on most documents.

Load-bearing premise

The fine-grained KG schema and iterative EEI loop are assumed to capture all implicit step attributes and complex section dependencies present in the manuals without systematic omission of critical configuration knowledge.

What would settle it

A new collection of Ethernet switch manuals in which the extracted graphs omit important configuration parameters or dependencies and thereby produce invalid test cases would show the framework fails to capture the full knowledge.

Figures

Figures reproduced from arXiv: 2605.19180 by Haiwei Dong, Hussein Al Osman, Jean Baptiste Minani, Lionel Briand, Mahboubeh Dadkhah, Rongqi Pan.

Figure 1
Figure 1. Figure 1: Overview of the proposed knowledge graph (KG) schema. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our multi-agent framework for knowledge graph extraction, evaluation, and improvement. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of the Configuration Roadmap section in the ESCM. stated goal and any associated notes, if present. Capturing these elements allows the KG to represent not only the sequence of configuration actions, but also the configuration intent and constraints that help interpret the purpose of each step and support downstream reasoning. Overview: You are a networking configuration assistant. Given a roadm… view at source ↗
Figure 4
Figure 4. Figure 4: The prompt for roadmap extraction agent. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example roadmap extraction output and the corresponding KG triples constructed from the output. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example illustrating the mapping between main steps in the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The prompt for roadmap–procedure mapping Agent. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example roadmap–procedure mapping output and the corresponding KG triples constructed from the output. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of the Procedure section in the ESCM. linguistic variation. We therefore included three representative examples in the final prompt to better capture these variations and guide the model toward more accurate extraction. • <Procedure Main Step>: This section provides the original procedure main step text from the ESCM for extraction. • <Response>: This section constrains the output to a predefine… view at source ↗
Figure 10
Figure 10. Figure 10: The prompt for procedure extraction agent. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example procedure extraction output and the corresponding KG triples constructed from the output. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The prompt for the EvalAgent. • <Inputs> This section provides the source text for each extraction task: Roadmap extraction uses the Configuration Roadmap section text, roadmap–procedure mapping uses both the Configuration Roadmap and Procedure section texts, and procedure extraction uses the Procedure section text, together with the extracted entities for that task. • <Response> This section defines the … view at source ↗
Figure 13
Figure 13. Figure 13: The prompt for the ImprovAgent. - note: any additional clarifications, explanations, conditions, or background text that is not itself the main action/goal. - note: only extract clarifications, explanations, conditions, examples, or background text that are not actions, requirements, scope/targets, or configuration attributes. - Do not extract as notes any phrases that specify where/what is being configur… view at source ↗
Figure 14
Figure 14. Figure 14: Example of the revised prompt. 3.2.4 Knowledge Graph Enhancement. According to the KG schema shown in [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Technical documents contain rich domain knowledge for automating downstream tasks such as system testing. While this paper focuses on Ethernet switch configuration manuals (ESCMs), we propose a general framework that can be adapted to different industrial contexts. ESCMs provide valuable domain knowledge for Ethernet switch testing, but their semi-structured format, implicit step attributes, and complex section dependencies make them difficult to directly leverage for test automation. To address this, we generate knowledge graphs (KGs) that capture configuration knowledge from ESCM in a structured form. We propose a multi-agent LLM-based framework that extracts, evaluates, and improves KGs from ESCMs using a fine-grained KG schema and an iterative Extract-Evaluate-Improve (EEI) loop. Our evaluation on 50 real-world ESCMs shows that our framework achieves high extraction correctness using the original prompts, with average correctness scores ranging from 0.97 to 0.99 across three extraction tasks. For challenging ESCMs, the EEI loop further improves correctness through manual-specific prompt refinement. Moreover, the LLM judgments and human evaluations show substantial agreement, with Cohen's kappa of at least 0.72 across all extraction tasks. Finally, feedback from industry testers indicates that the generated KGs can support the generation of useful and correct test case specifications (TCSs) for downstream testing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes a multi-agent LLM-based framework for extracting knowledge graphs from semi-structured Ethernet switch configuration manuals (ESCMs) to support system testing. It introduces a fine-grained KG schema and an iterative Extract-Evaluate-Improve (EEI) loop, evaluates the approach on 50 real-world ESCMs reporting average correctness scores of 0.97–0.99 across three extraction tasks, documents Cohen’s kappa ≥ 0.72 agreement with human evaluators, and includes positive feedback from industry testers on utility for generating test case specifications.

Significance. If the extraction results hold, the work offers a practical contribution to automating knowledge capture from industrial technical documents for downstream testing tasks. Strengths include the use of 50 real documents, quantitative correctness metrics, human-LLM agreement statistics, and direct industry feedback. These elements provide empirical grounding for the extraction fidelity claims and suggest applicability beyond the Ethernet switch case study.

major comments (1)
  1. [§4] §4 (KG Schema Definition and EEI Loop): The evaluation reports high correctness (0.97–0.99) and human agreement (κ ≥ 0.72) but measures fidelity to the chosen schema rather than schema completeness. No independent audit is described that checks whether recurring implicit elements—such as conditional configuration rules, cross-section references, or vendor-specific step attributes—are systematically omitted from the fine-grained schema. This assumption is load-bearing for the claim that the resulting KGs fully support generation of correct and complete test case specifications.
minor comments (2)
  1. [Abstract] Abstract: The three extraction tasks are referenced but not named or briefly characterized, which would improve immediate readability for readers unfamiliar with the domain.
  2. [§5] §5 (Evaluation): Table or figure captions for the correctness scores and kappa values could explicitly state the number of documents and evaluators to allow quick assessment of statistical robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and describe the revisions we will make to strengthen the presentation of our approach.

read point-by-point responses
  1. Referee: [§4] §4 (KG Schema Definition and EEI Loop): The evaluation reports high correctness (0.97–0.99) and human agreement (κ ≥ 0.72) but measures fidelity to the chosen schema rather than schema completeness. No independent audit is described that checks whether recurring implicit elements—such as conditional configuration rules, cross-section references, or vendor-specific step attributes—are systematically omitted from the fine-grained schema. This assumption is load-bearing for the claim that the resulting KGs fully support generation of correct and complete test case specifications.

    Authors: We appreciate the referee’s distinction between extraction fidelity to a given schema and independent verification of schema completeness. The fine-grained KG schema was constructed from an initial analysis of representative ESCMs together with domain knowledge provided by our industry collaborators, with the explicit goal of supporting downstream test-case specification generation. The EEI loop and manual-specific prompt refinements were introduced precisely to improve extraction quality within that schema. We acknowledge that the reported metrics (correctness 0.97–0.99 and κ ≥ 0.72) evaluate how faithfully the extracted graphs conform to the chosen schema rather than whether every possible implicit element (conditional rules, cross-references, vendor-specific attributes) has been exhaustively captured. At the same time, the positive feedback from industry testers on the utility of the resulting KGs for producing correct and complete test cases offers pragmatic evidence that the schema is adequate for the intended use. In the revised manuscript we will (i) expand §4 with a clearer description of the schema-design process and its grounding in domain input, (ii) explicitly state that correctness is measured relative to the schema, and (iii) add a limitations paragraph discussing potential omissions and how the framework can be extended if additional elements are later identified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation relies on external human judgments and industry feedback

full rationale

The paper presents a multi-agent LLM framework for KG extraction from ESCMs, with evaluation on 50 real-world documents using human correctness scores (0.97-0.99) and Cohen's kappa agreement (≥0.72). These metrics are computed against independent human annotations and tester feedback rather than any fitted parameters, self-referential predictions, or schema-derived quantities that reduce to the framework's own inputs. The KG schema and EEI loop are design choices justified by domain needs, not derived from the reported results. No equations, self-citation load-bearing uniqueness theorems, or renamings of known results appear in the derivation chain. The central claims remain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework relies on standard LLM capabilities for text understanding and a custom domain schema; no numerical free parameters are introduced and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Large language models guided by a fine-grained schema can reliably extract configuration knowledge from semi-structured technical manuals.
    This premise is invoked to justify the extraction tasks and is supported by the reported correctness scores but remains an empirical assumption rather than a proven property.

pith-pipeline@v0.9.0 · 5795 in / 1256 out tokens · 37369 ms · 2026-05-20T08:28:49.366622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

  1. [1]

    Langchain Official Website

    Accessed: 2025. Langchain Official Website. https://www.langchain.com/

  2. [2]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  3. [3]

    Zhen Bi, Jing Chen, Yinuo Jiang, Feiyu Xiong, Wei Guo, Huajun Chen, and Ningyu Zhang. 2024. Codekgc: Code language model for generative knowledge graph construction.ACM Transactions on Asian and Low-Resource Language Information Processing23, 3 (2024), 1–16

  4. [4]

    Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, and He Ye. 2025. Prometheus: Unified knowledge graphs for issue resolution in multilingual codebases.arXiv preprint arXiv:2507.19942(2025)

  5. [5]

    Alberto Rodrigues Da Silva, Ana CR Paiva, and Valter ER Da Silva. 2018. A test specification language for information systems based on data entities, use cases and state machines. InInternational Conference on Model-Driven Engineering and Software Development. Springer, 455–474

  6. [6]

    Jinlan Fu, See Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. Gptscore: Evaluate as you desire. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6556–6576

  7. [7]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. 2024. A survey on llm-as-a-judge.The Innovation(2024)

  8. [8]

    Liang Guo, Fu Yan, Yuqian Lu, Ming Zhou, and Tao Yang. 2021. An automatic machining process decision-making system based on knowledge graph.International journal of computer integrated manufacturing34, 12 (2021), 1348–1369

  9. [9]

    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. 2021. Knowledge graphs.ACM Computing Surveys (Csur)54, 4 (2021), 1–37

  10. [10]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys55, 12 (2023), 1–38

  11. [11]

    Jaehun Jung, Faeze Brahman, and Yejin Choi. 2024. Trust or escalate: Llm judges with provable guarantees for human agreement.arXiv preprint arXiv:2407.18370(2024)

  12. [12]

    Vamsi Krishna Kommineni, Birgitta König-Ries, and Sheeba Samuel. 2024. From human experts to machines: An LLM supported approach to ontology and knowledge graph construction.arXiv preprint arXiv:2403.08345(2024)

  13. [13]

    Fuliang Li, Jiahai Yang, Jianping Wu, Zhiyan Zheng, Huijing Zhang, and Xingwei Wang. 2014. Configuration analysis and recommendation: Case studies in IPv6 networks.Computer Communications53 (2014), 37–51

  14. [14]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  15. [15]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics12 (2024), 157–173

  16. [16]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: NLG evaluation using gpt-4 with better human alignment.arXiv preprint arXiv:2303.16634(2023)

  17. [17]

    Satoshi Masuda, Satoshi Kouzawa, Kyousuke Sezai, Hidetoshi Suhara, Yasuaki Hiruta, and Kunihiro Kudou. 2026. Generating high-level test cases from requirements using LLM: An industry study. In2026 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA). IEEE, 1–9

  18. [18]

    Dragan Milchevski, Gordon Frank, Anna Hätty, Bingqing Wang, Xiaowei Zhou, and Zhe Feng. 2025. Multi-Step Generation of Test Specifications using Large Language Models for System-Level Requirements. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track). 132–146

  19. [19]

    OpenAI. [n. d.]. OpenAI API Reference. https://platform.openai.com/docs/api-reference/introduction. Accessed: 2026-02-04

  20. [20]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Accessed: 2026-04-23

  21. [21]

    Kazuki Otomo, Satoru Kobayashi, Kensuke Fukuda, Osamu Akashi, Kimihiro Mizutani, and Hiroshi Esaki. 2021. Towards extracting semantics of network config blocks. In2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 1443–1448

  22. [22]

    Jeff Z Pan, Simon Razniewski, Jan-Christoph Kalo, Sneha Singhania, Jiaoyan Chen, Stefan Dietze, Hajira Jabeen, Janna Omeliyanenko, Wen Zhang, Matteo Lissandrini, et al. 2023. Large language models and knowledge graphs: Opportunities and challenges.arXiv preprint arXiv:2308.06374(2023)

  23. [23]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318. Manuscript submitted to ACM 44 Pan et al

  24. [24]

    Ciyuan Peng, Feng Xia, Mehdi Naseriparsa, and Francesco Osborne. 2023. Knowledge graphs: Opportunities and challenges.Artificial intelligence review56, 11 (2023), 13071–13102

  25. [25]

    Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409(2021)

  26. [26]

    Syed Tahseen Raza Rizvi, Dominique Mercier, Stefan Agne, Steffen Erkel, Andreas Dengel, and Sheraz Ahmed. 2018. Ontology-based Information Extraction from Technical Documents.. InICAART (2). 493–500

  27. [27]

    Monalisa Sarma and Rajib Mall. 2009. Automatic generation of test specifications for coverage of system state transitions.Information and Software Technology51, 2 (2009), 418–432

  28. [28]

    HM Sneed. 1993. Automated tool support for ANSI/IEEE STD: 829-1983 software test documentation. InProceedings 1993 Software Engineering Standards Symposium. IEEE, 308–316

  29. [29]

    Yanqi Su, Zheming Han, Zhenchang Xing, Xin Xia, Xiwei Xu, Liming Zhu, and Qinghua Lu. 2022. Constructing a system knowledge graph of user tasks and failures from bug reports to support soap opera testing. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–13

  30. [30]

    Yanqi Su, Dianshu Liao, Zhenchang Xing, Qing Huang, Mulong Xie, Qinghua Lu, and Xiwei Xu. 2024. Enhancing exploratory testing by large language model and knowledge graph. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–12

  31. [31]

    Yanqi Su, Zhenchang Xing, Chong Wang, Chunyang Chen, Sherry Xu, Qinghua Lu, and Liming Zhu. 2025. Automated soap opera testing directed by llms and scenario knowledge: Feasibility, challenges, and road ahead.Proceedings of the ACM on Software Engineering2, FSE (2025), 757–778

  32. [32]

    2010.Practical model-based testing: a tools approach

    Mark Utting and Bruno Legeard. 2010.Practical model-based testing: a tools approach. Elsevier

  33. [33]

    Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is chatgpt a good nlg evaluator? a preliminary study.arXiv preprint arXiv:2303.04048(2023)

  34. [34]

    Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-based test generation: A comprehensive survey.ACM Transactions on Software Engineering and Methodology(2025)

  35. [35]

    Hongbin Ye, Ningyu Zhang, Hui Chen, and Huajun Chen. 2022. Generative knowledge graph construction: A review. InProceedings of the 2022 conference on empirical methods in natural language processing. 1–17

  36. [36]

    Bowen Zhang and Harold Soh. 2024. Extract, define, canonicalize: An llm-based framework for knowledge graph construction. InProceedings of the 2024 conference on empirical methods in natural language processing. 9820–9836

  37. [37]

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2025. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.Computational Linguistics(2025), 1–46

  38. [38]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al

  39. [39]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  40. [40]

    Lianghui Zhu, Xinggang Wang, and Xinlong Wang. 2023. Judgelm: Fine-tuned large language models are scalable judges.arXiv preprint arXiv:2310.17631(2023). Manuscript submitted to ACM