pith. sign in

arxiv: 2605.18025 · v1 · pith:CZ6WGRLFnew · submitted 2026-05-18 · 💻 cs.AI

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

Pith reviewed 2026-05-20 11:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords Large Language ModelsTelecommunicationsBenchmarkIndustrial WorkflowsExecution Gap5G NetworksNetwork Fault MaintenanceAgent Evaluation
0
0 comments X

The pith

Large language models understand telecom language tasks at 90 percent but generate solutions at only 30 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TeleCom-Bench to measure how close current LLMs come to handling real telecommunications work. It builds two layers of tests: one that checks grasp of 3GPP protocols, 5G architecture, and product documentation, and another that runs six tasks taken directly from live network agent logs, including intent recognition, root cause analysis, and solution generation. Results across eight leading models show strong performance on the language-facing steps but a sharp drop when the model must produce executable procedures. A sympathetic reader cares because this gap explains why LLMs remain experimental in telecom control rooms rather than trusted field engineers.

Core claim

Evaluations on TeleCom-Bench reveal a universal Execution Wall: models reach roughly 90 percent accuracy on linguistic interface tasks such as intent recognition and entity extraction yet fall to approximately 30 percent on procedural execution tasks such as solution generation, showing that current LLMs function competently as diagnosticians but fail as field engineers.

What carries the argument

TeleCom-Bench, a benchmark of 12 evaluation sets and 22,678 samples that separates multi-dimensional knowledge comprehension (via knowledge-graph synthesis of protocols and product data) from end-to-end knowledge application on six authentic network-agent tasks.

If this is right

  • Telecom operators can adopt the benchmark to track whether fine-tuned models cross the threshold for safe deployment in fault-maintenance workflows.
  • Developers should allocate alignment effort to procedural reasoning rather than further gains on language-understanding subtasks.
  • The benchmark supplies concrete diagnostics that can steer domain-specific continued pre-training or tool-augmented agent designs.
  • Similar execution gaps are likely to appear in any vertical that pairs documentation with sequential equipment actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same comprehension-versus-execution split may limit LLM agents in other equipment-heavy sectors such as power-grid operations or semiconductor manufacturing.
  • Extending the benchmark with closed-loop simulators would let researchers measure whether generated solutions actually resolve the reported faults.
  • Hybrid architectures that pair LLMs with rule-based verification modules could close the gap faster than scaling alone.

Load-bearing premise

The six core tasks drawn from live network agent workflows are assumed to capture the essential skills required for production-ready telecom agents.

What would settle it

A controlled test in which any of the evaluated models achieves sustained accuracy above 60 percent on solution-generation items while using only the supplied equipment documentation and without external tools would falsify the claimed universal Execution Wall.

Figures

Figures reproduced from arXiv: 2605.18025 by Chaoyu Zhang, Chen Zhong, Ding Zou, Dongyang Xu, Fang Tan, Huizhen Qiu, Jieting Xiao, Qiaobo Hao, Rui Ma, Xiao Long, Yanqin Gao, Yun Lin, Zhiguo Yang.

Figure 1
Figure 1. Figure 1: Hierarchical Evaluation Structure of TeleCom [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Knowledge Comprehension Pipeline: From Acquisition to Evaluation Generation.This pipeline covers three stages and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 2.2.1 Data Acquisition and Preprocessing. To ensure the bench￾mark reflects the complexity of real-world networks, we collect raw data from large-scale commercial networks via Network El￾ement Management Systems and Unified Management Environ￾ments. The data acquisition utilizes hybrid protocols including MML commands, SFTP file transfer, and SNMP. The dataset en￾compasses three core dimensions: Performanc… view at source ↗
Figure 3
Figure 3. Figure 3: Knowledge Application Pipeline: From Acquisition to Evaluation Generation.This pipeline illustrates the full workflow [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance radar charts of diverse LLMs across Telecom-specific capabilities. The benchmarks are categorized into [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example formats of three main question types in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The 23-tool library provided as prompt context. Each tool includes a name and functional specification, enabling [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Model responses to a complex fault resolution task. Generalist models produce unstructured advice or hallucinated [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at https://github.com/ZTE-AICloud/TeleCom-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TeleCom-Bench, a benchmark with 12 evaluation sets and 22,678 curated samples that assesses LLMs on a hierarchy of multi-dimensional knowledge comprehension (telecom fundamentals, 3GPP protocols, 5G architecture, and proprietary product knowledge) and end-to-end knowledge application. The latter formalizes six core tasks—intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation—drawn from authentic live network agent workflows in optimization and fault maintenance. Evaluations of eight state-of-the-art LLMs show ~90% accuracy on linguistic interface tasks but collapse to ~30% on procedural execution tasks such as solution generation, leading to the claim of a universal 'Execution Wall' where LLMs function as competent diagnosticians but fail as field engineers. The dataset and code are released publicly.

Significance. If the benchmark tasks are representative of industrial requirements, the reported performance gap supplies concrete, standardized diagnostics for LLM limitations in telecom and offers actionable guidance for domain-specific alignment toward production-ready agents. The public release of the dataset and evaluation code is a clear strength that supports reproducibility and community follow-up work.

major comments (2)
  1. [Abstract / End-to-End Knowledge Application description] End-to-End Knowledge Application (six core tasks): The central 'Execution Wall' interpretation—that the 90%-to-30% gap demonstrates LLMs 'fail as field engineers'—rests on the assumption that the six tasks drawn from live trajectories capture the essential capabilities for production-ready telecom agents. The manuscript provides no reported expert validation, coverage analysis against full job requirements (e.g., novel equipment states, real-time uncertainty, or physical-layer coordination), or comparison to broader industrial workflows; this is load-bearing for generalizing the observed collapse beyond the specific benchmark.
  2. [Abstract] Abstract (sample curation): The concrete accuracy numbers and the 22,678-sample count are presented without details on curation criteria, inter-annotator agreement, or exclusion rules for the trajectories. This weakens independent verification of the exact 90-to-30 gap and its attribution to model capability rather than benchmark construction choices.
minor comments (2)
  1. Consider adding a summary table that reports per-LLM, per-task accuracies (including confidence intervals or variance across runs) to make the 'universal' gap claim easier to inspect at a glance.
  2. Clarify the exact definition and scoring rubric for 'solution generation' and 'root cause analysis' tasks, as these appear to be the primary drivers of the reported performance drop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract / End-to-End Knowledge Application description] End-to-End Knowledge Application (six core tasks): The central 'Execution Wall' interpretation—that the 90%-to-30% gap demonstrates LLMs 'fail as field engineers'—rests on the assumption that the six tasks drawn from live trajectories capture the essential capabilities for production-ready telecom agents. The manuscript provides no reported expert validation, coverage analysis against full job requirements (e.g., novel equipment states, real-time uncertainty, or physical-layer coordination), or comparison to broader industrial workflows; this is load-bearing for generalizing the observed collapse beyond the specific benchmark.

    Authors: We agree that the manuscript does not report a formal expert validation study or quantitative coverage analysis against the full spectrum of job requirements such as novel equipment states or physical-layer coordination. The six tasks were formalized directly from authentic live network agent workflows in optimization and fault maintenance, as described in the paper. To strengthen the grounding of the 'Execution Wall' claim, we will revise the manuscript to include additional details on the workflow analysis process and how these tasks map to core industrial procedures. This will better contextualize the scope of the observed performance gap without overgeneralizing beyond the benchmark. revision: yes

  2. Referee: [Abstract] Abstract (sample curation): The concrete accuracy numbers and the 22,678-sample count are presented without details on curation criteria, inter-annotator agreement, or exclusion rules for the trajectories. This weakens independent verification of the exact 90-to-30 gap and its attribution to model capability rather than benchmark construction choices.

    Authors: We concur that explicit details on curation criteria, inter-annotator agreement, and exclusion rules are necessary for independent verification. The current manuscript provides only a high-level description of the 22,678 curated samples. In the revised version, we will expand the relevant sections to document the curation process, including agreement metrics and trajectory filtering rules, thereby allowing readers to more rigorously assess the benchmark construction and the reported performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper constructs TeleCom-Bench as a new test set from live network trajectories and measures LLM accuracy on six tasks; the central claim of an Execution Wall is a direct empirical observation on external models rather than any derivation, equation, or parameter fit that reduces to the authors' inputs. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the results. The benchmark is self-contained against external models and the newly released dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that the selected samples and task trajectories are representative of real production systems; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption The 22,678 curated samples and six core tasks drawn from live network agent workflows accurately represent essential industrial skills.
    This premise underpins the claim that the observed Execution Wall reflects a genuine capability gap rather than an artifact of test construction.

pith-pipeline@v0.9.0 · 5858 in / 1278 out tokens · 37959 ms · 2026-05-20T11:19:34.996916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 1 internal anchor

  1. [1]

    Tasnim Ahmed, Nicola Piovesan, Antonio De Domenico, and Salimur Choudhury

  2. [2]

    In2024 IEEE International Conference on Communications Workshops (ICC Work- shops)

    Linguistic intelligence in large language models for telecommunications. In2024 IEEE International Conference on Communications Workshops (ICC Work- shops). IEEE, 1237–1243

  3. [3]

    Clément Barboule, Van-Phuc Huynh, Alexandre Bufort, et al . 2024. TelcoLM: Collecting Data, Adapting, and Benchmarking Language Models for the Telecom- munication Domain.arXiv preprint arXiv:2412.15891(2024)

  4. [4]

    Gordon Owusu Boateng, Hani Sami, Ahmed Alagha, Hanae Elmekki, Ahmad Hammoud, Rabeb Mizouni, Azzam Mourad, Hadi Otrok, Jamal Bentahar, Sami Muhaidat, et al. 2025. A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions.IEEE Communications Surveys & Tutorials(2025)

  5. [5]

    Lun-Chi Chen, Mayuresh Sunil Pardeshi, Yi-Xiang Liao, and Kai-Chih Pai. 2025. Application of retrieval-augmented generation for interactive industrial knowl- edge management via a large language model.Computer Standards & Interfaces 94 (2025), 103995

  6. [6]

    Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, and Nanqing Dong. 2025. GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation. arXiv:2505.20416 [cs.CL] https://arxiv.org/abs/2505.20416

  7. [7]

    Kewei Cheng, Nesreen K Ahmed, Ryan A Rossi, Theodore Willke, and Yizhou Sun. 2025. Neural-symbolic methods for knowledge graph reasoning: A survey. ACM Transactions on Knowledge Discovery from Data18, 9 (2025), 1–44

  8. [8]

    Vincenzo Colle, Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fad- hel Ayed, and Merouane Debbah. 2025. TeleMath: A Benchmark for Large Language Models in Telecom Mathematical Problem Solving.arXiv preprint arXiv:2506.10674(2025)

  9. [9]

    2025.An Introduction to 5G: The New Radio, 5G Network, 5G Advanced and Beyond

    Christopher Cox. 2025.An Introduction to 5G: The New Radio, 5G Network, 5G Advanced and Beyond. John Wiley & Sons

  10. [10]

    Xinbang Dai, Yuncheng Hua, Tongtong Wu, Yang Sheng, Qiu Ji, and Guilin Qi

  11. [11]

    Large language models can better understand knowledge graphs than we thought.Knowledge-Based Systems312 (2025), 113060

  12. [12]

    Jinru Ding, Chao Ding, Wenrao Pang, Boyi Xiao, Zhiqiang Liu, Pengcheng Chen, Jiayuan Chen, Tiantian Yuan, Junming Guan, Yidong Jiang, et al. 2025. CNFin- Bench: A Benchmark for Safety and Compliance of Large Language Models in Finance.arXiv preprint arXiv:2512.09506(2025)

  13. [13]

    Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, et al. 2025. MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multi- modal Models, and Intelligent Agents.arXiv preprint arXiv:2511.14439(2025)

  14. [14]

    Yixin Dong, Charlie F Ruan, Yaxing Cai, Ziyi Xu, Yilong Zhao, Ruihang Lai, and Tianqi Chen. 2025. Xgrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems7 (2025)

  15. [15]

    Vignesh Ethiraj, Divya Vijay, Sidhanth Menon, and Heblin Berscilla. 2025. Effi- cient Telecom Specific LLM: TSLAM-Mini with QLoRA and Digital Twin Data. arXiv preprint arXiv:2505.07877(2025)

  16. [16]

    Anas Ezzakri, Nicola Piovesan, Mohamed Sana, Antonio De Domenico, Fadhel Ayed, and Haozhe Zhang. 2025. TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation.arXiv preprint arXiv:2601.04202(2025)

  17. [17]

    Pranshav Gajjar and Vijay K Shah. 2025. Oran-bench-13k: An open source benchmark for assessing llms in open radio access networks. In2025 IEEE 22nd Consumer Communications & Networking Conference (CCNC). IEEE, 1–4

  18. [18]

    Andrew Gao. 2023. Prompt engineering for large language models.A vailable at SSRN 4504303(2023)

  19. [19]

    Zengyi Gao, Yukun Cao, Hairu Wang, Ao Ke, Yuan Feng, S Kevin Zhou, and Xike Xie. 2025. Frag: A flexible modular framework for retrieval-augmented generation based on knowledge graphs. InFindings of the Association for Computational Linguistics: ACL 2025. 6178–6192

  20. [20]

    S Garcia Murillo and A Gouaillard. 2025. RFC 9725: WebRTC-HTTP Ingestion Protocol (WHIP)

  21. [21]

    Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, et al. 2025. A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148(2025)

  22. [22]

    Siming Huang, Tianhao Cheng, Jason Klein Liu, Weidi Xu, Jiaran Hao, Liuyihan Song, Yang Xu, Jian Yang, Jiaheng Liu, Chenchen Zhang, et al. 2025. Opencoder: The open cookbook for top-tier code large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 33167–33193

  23. [23]

    Feibo Jiang, Cunhua Pan, Li Dong, Kezhi Wang, Merouane Debbah, Dusit Niy- ato, and Zhu Han. 2025. A comprehensive survey of large ai models for fu- ture communications: Foundations, applications and challenges.arXiv preprint arXiv:2505.03556(2025)

  24. [24]

    Imtiaz Karim, Kazi Samin Mubasshir, Mirza Masfiqur Rahman, and Elisa Bertino

  25. [25]

    SPEC5G: A dataset for 5G cellular network protocol analysis.arXiv preprint arXiv:2301.09201(2023)

  26. [26]

    Fahime Khoramnejad and Ekram Hossain. 2025. Generative AI for the optimiza- tion of next-generation wireless networks: Basics, state-of-the-art, and open challenges.IEEE Communications Surveys & Tutorials(2025)

  27. [27]

    Simon Knollmeyer, Oğuz Caymazer, and Daniel Grossmann. 2025. Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain.Electronics 14, 11 (2025), 2102

  28. [28]

    Woongsup Lee and Jeonghun Park. 2026. LLM-Empowered Resource Allocation in Wireless Communications Systems.IEEE Access14 (2026), 15260–15272. doi:10. 1109/ACCESS.2026.3655801

  29. [29]

    Xingqin Lin. 2025. 3GPP Evolution from 5G to 6G: A 10-Year Retrospective. In Telecom, Vol. 6. MDPI, 32

  30. [30]

    Xingqin Lin. 2025. The bridge toward 6G: 5G-Advanced evolution in 3GPP Release I9.IEEE Communications Standards Magazine9, 1 (2025), 28–35

  31. [31]

    Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin. 2025. Datasets for large language models: A comprehensive survey.Artificial Intelligence Review 58, 12 (2025), 403

  32. [32]

    Yuhe Liu, Changhua Pei, Longlong Xu, Bohan Chen, Mingze Sun, Zhirui Zhang, Yongqian Sun, Shenglin Zhang, Kun Wang, Haiming Zhang, et al. 2025. Opseval: A comprehensive benchmark suite for evaluating large language models’ capability in it operations domain. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 503–513

  33. [33]

    Sifan Long, Jingjing Tan, Bomin Mao, Fengxiao Tang, Yangfan Li, Ming Zhao, and Nei Kato. 2025. A survey on intelligent network operations and performance optimization based on large language models.IEEE Communications Surveys & Tutorials(2025)

  34. [34]

    Sifan Long, Jingjing Tan, Bomin Mao, Fengxiao Tang, Yangfan Li, Ming Zhao, and Nei Kato. 2025. A Survey on Intelligent Network Operations and Performance Optimization Based on Large Language Models.IEEE Communications Surveys & Tutorials27, 6 (2025), 3915–3949. doi:10.1109/COMST.2025.3526606

  35. [35]

    Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. 2024. Layoutllm: Layout instruction tuning with large language models for document understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15630–15640

  36. [36]

    Ali Maatouk, Fadhel Ayed, Nicola Piovesan, Antonio De Domenico, Merouane Debbah, and Zhi-Quan Luo. 2025. Teleqna: A benchmark dataset to assess large language models telecommunications knowledge.IEEE Network(2025)

  37. [37]

    Abdul Majeed and Sungchang Lee. 2020. Anonymization techniques for privacy preserving data publishing: A comprehensive survey.IEEE access9 (2020), 8512– 8545

  38. [38]

    Jeevan Kumar Manda. 2023. Privacy-Preserving Technologies in Telecom Data Analytics: Implementing Privacy-Preserving Techniques Like Differential Privacy to Protect Sensitive Customer Data During Telecom Data Analytics.A vailable at SSRN 5136773(2023)

  39. [39]

    Congmin Min, Sahil Bansal, Joyce Pan, Abbas Keshavarzi, Rhea Mathew, and Amar Viswanathan Kannan. 2025. Towards Practical GraphRAG: Efficient Knowledge Graph Construction and Hybrid Retrieval at Scale.arXiv preprint arXiv:2507.03226(2025)

  40. [40]

    Mahesh Mokale. 2024. Data Anonymization Techniques for Enhanced User Privacy in Telecommunications. (2024)

  41. [41]

    Said Gurbuz, Michele Dolfi, Miquel Farré, and Peter W

    Ahmed Nassar, Andres Marafioti, Matteo Omenetti, Maksym Lysak, Nikolaos Livathinos, Christoph Auer, Lucas Morin, Rafael Teixeira de Lima, Yusik Kim, A. Said Gurbuz, Michele Dolfi, Miquel Farré, and Peter W. J. Staar. 2025. Smol- Docling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. arXiv:2503.11576 [cs.CV] https:/...

  42. [42]

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72

  43. [43]

    Rasoul Nikbakht, Mohamed Benzaghta, and Giovanni Geraci. 2024. Tspec-llm: An open-source dataset for llm understanding of 3gpp specifications.arXiv preprint arXiv:2406.01768(2024)

  44. [44]

    Grzegorz Panek, Piotr Matysiak, Marcin Ziółkowski, Ilhem Fajjari, Cyril Auboin, and Iwona Wojdan. 2025. Taia: Telco generative ai-powered multi-agent assistant for managing cloud-native networks. In2025 IEEE International Conference on Communications Workshops (ICC Workshops). IEEE, 238–243

  45. [45]

    Diego Frazatto Pedroso, Luís Almeida, Lucas Eduardo Gulka Pulcinelli, William Akihiro Alves Aisawa, Inês Dutra, and Sarita Mazzini Bruschi. 2025. Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks.IEEE Access13 (2025), 77550–77564. doi:10.1109/ACCESS.2025.3565220

  46. [46]

    Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, and Dan Pei. 2025. Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis. Association for Computing Machinery, New York, NY, USA. doi:10. 1145/3701716.3715225

  47. [47]

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2025. Graph retrieval-augmented generation: A survey. KDD’2026, August 9-13, 2026, Jeju, Korea Jieting Xiao et al. ACM Transactions on Information Systems44, 2 (2025), 1–52

  48. [48]

    Petar Radanliev. 2025. Artificial intelligence: reflecting on the past and looking towards the next paradigm shift.Journal of Experimental & Theoretical Artificial Intelligence37, 7 (2025), 1045–1062

  49. [49]

    PR Sudha Rani and Aaluri Seenu. 2025. Automated Multiple-Choice Question Generation Using Gemini Gen AI.American Advanced Journal for Emerging Disciplinaries (AAJED) ISSN: 3067-41903, 1 (2025)

  50. [50]

    Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Yibin Kang, Haozhe Zhang, Merouane Debbah, and Fadhel Ayed. 2025. Reasoning language models for root cause analysis in 5G wireless networks.arXiv preprint arXiv:2507.21974 (2025)

  51. [51]

    Adnan Shahid, Adrian Kliks, Ahmed Al-Tahmeesschi, Ahmed Elbakary, Alexan- dros Nikou, Ali Maatouk, Ali Mokh, Amirreza Kazemi, Antonio De Domenisco, Athanasios Karapantelakis, et al. 2025. Large-scale AI in telecom: Charting the roadmap for innovation, scalability, and enhanced digital experiences.arXiv preprint arXiv:2503.04184(2025)

  52. [52]

    Haochen Shi, Xinyao Liu, Fengmao Lv, Hongtao Xue, Jie Hu, Shengdong Du, and Tianrui Li. 2025. A pre-trained data deduplication model based on active learning.Expert Systems with Applications(2025), 128628

  53. [53]

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. 2025. Continual learning of large language models: A comprehensive survey.Comput. Surveys58, 5 (2025), 1–42

  54. [54]

    Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, and Xiuying Chen. 2025. Injecting domain-specific knowledge into large language models: a comprehensive survey.arXiv preprint arXiv:2502.10708(2025)

  55. [55]

    Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang

  56. [56]

    InProceedings of the ACM on Web Conference 2025

    Paths-over-graph: Knowledge graph empowered large language model reasoning. InProceedings of the ACM on Web Conference 2025. 3505–3522

  57. [57]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language mod- els with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 13484–13508

  58. [58]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  59. [59]

    Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ning- hao Liu, and Dong Yu. 2024. From language modeling to instruction following: Understanding the behavior shift in llms after instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Techn...

  60. [60]

    Derong Xu, Xinhang Li, Ziheng Zhang, Zhenxi Lin, Zhihong Zhu, Zhi Zheng, Xian Wu, Xiangyu Zhao, Tong Xu, and Enhong Chen. 2025. Harnessing large language models for knowledge graph question answering via adaptive multi- aspect retrieval-augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25570–25578

  61. [61]

    Bo Yang, Ruihuai Liang, Weixin Li, Han Wang, Xuelin Cao, Zhiwen Yu, Samson Lasaulce, Mérouane Debbah, Mohamed-Slim Alouini, H Vincent Poor, et al. 2026. Frontiers of generative AI for network optimization: Theories, limits, and visions. IEEE Communications Surveys & Tutorials(2026)

  62. [62]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models, 2023.URL https://arxiv. org/abs/2305.106013 (2023), 1

  63. [63]

    Lingzhe Zhang, Tong Jia, Mengxi Jia, Yifan Wu, Aiwei Liu, Yong Yang, Zhonghai Wu, Xuming Hu, Philip S Yu, and Ying Li. 2024. A survey of aiops for failure management in the era of large language models.arXiv preprint arXiv:2406.11213 (2024)

  64. [64]

    Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Junnan Dong, et al . 2025. A survey of graph retrieval-augmented generation for customized large language models.arXiv preprint arXiv:2501.13958(2025)

  65. [65]

    Ruichen Zhang, Shunpu Tang, Yinqiu Liu, Dusit Niyato, Zehui Xiong, Sumei Sun, Shiwen Mao, and Zhu Han. 2025. Toward agentic ai: generative information retrieval inspired intelligent communications and networking.arXiv preprint arXiv:2502.16866(2025)

  66. [66]

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. 2026. Instruction tuning for large language models: A survey.Comput. Surveys58, 7 (2026), 1–36

  67. [67]

    Mingwei Zheng, Danning Xie, Qingkai Shi, Chengpeng Wang, and Xiangyu Zhang. 2025. Validating network protocol parsers with traceable rfc document interpretation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1772–1794

  68. [68]

    Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, and Eric P Xing. 2025. Megamath: Pushing the limits of open math corpora.arXiv preprint arXiv:2504.02807(2025)

  69. [69]

    Hao Zhou, Chengming Hu, Dun Yuan, Ye Yuan, Di Wu, Xi Chen, Hina Tabassum, and Xue Liu. 2025. Large language models for wireless networks: an overview from the prompt engineering perspective.IEEE Wireless Communications(2025)

  70. [70]

    Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, et al. 2024. Large language model (llm) for telecom- munications: A comprehensive survey on principles, key techniques, and oppor- tunities.IEEE Communications Surveys & Tutorials27, 3 (2024), 1955–2005

  71. [71]

    Shengqi Zhu and Jeffrey Rzeszotarski. 2025. What we talk about when we talk about LMs: implicit paradigm shifts and the ship of language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 4628–4646

  72. [72]

    NR Base Station Super Cell CP Out-of-Service

    Hang Zou, Qiyang Zhao, Yu Tian, Lina Bariah, Faouzi Bader, Thierry Lestable, and Merouane Debbah. 2025. Telecomgpt: A framework to build telecom-specific large language models.IEEE Transactions on Machine Learning in Communications and Networking(2025). A Appendix A.1 A.1 Question Format Specifications in TeleCom-Bench TeleCom-Bench is a benchmark dataset...

  73. [73]

    [IQ Fragment Cleanup&defragmentIQ]

  74. [74]

    [Alarm Recovery Check] glm4.7 Step 1: [Alarm Recovery Check] &observation and determination duration=3 minutes& Step 2: [IQ Fragment Cleanup] &method=defragmentIQ& grok 4.1

  75. [75]

    [IQ Fragment Cleanup &defragmentIQNR&]

  76. [76]

    Generalist models produce unstructured advice or hallucinated commands, failing to utilize the provided tool interface

    [Alarm Recovery Check] Standard Answer step1.[IQ Fragment Cleanup] step2.[Alarm Recovery Check] step3.[Notify Human For Handling] deepseek v3.2 [Alarm Recovery Check] [IQ Fragment Cleanup&defragmentIQ] [Alarm Recovery Check&3 minutes] Figure 7: Model responses to a complex fault resolution task. Generalist models produce unstructured advice or hallucinate...