pith. machine review for the scientific record. sign in

arxiv: 2604.09682 · v1 · submitted 2026-04-03 · 💻 cs.NI · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Decision-Theoretic Safety Assessment of Persona-Driven Multi-Agent Systems in O-RAN

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:59 UTC · model grok-4.3

classification 💻 cs.NI cs.AI
keywords multi-agent systemsO-RANpersona-driven agentsdecision-theoretic evaluationLLM safetynetwork managementretrieval augmented generationagent alignment
0
0 comments X

The pith

Persona alignment in O-RAN multi-agent systems boosts individual performance by 14.3 percent while shaping emergent coordination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a persona-driven approach for multi-agent systems managing Open Radio Access Networks, where each agent receives a behavioral specification that sets its priorities and decision style. By testing hundreds of these configurations on tasks like energy-efficient resource allocation and load balancing, the work shows that good persona matching improves how well agents perform and how they work together as a group. A decision-theory-based evaluation checks three aspects: whether agents follow optimal strategies, stick to their assigned guidelines, and produce desirable group-level behaviors. The findings indicate that retrieval methods used to customize the agents also play a big role in how much the personas can actually change the system's results. Such validation matters for safely using large language models to automate critical network management without constant human intervention.

Core claim

The paper establishes that in persona-driven multi-agent systems for O-RAN, alignment between agent personas and roles significantly impacts individual performance by up to 14.3 percent and influences emergent multi-agent coordination properties. The evaluation across 486 configurations on energy efficient resource allocation and network load balancing reveals that retrieval architecture choices constrain the effectiveness of persona customization, while single-agent modifications can propagate through the system leading to detectable incompatibilities in certain combinations.

What carries the argument

The persona-driven multi-agent framework consisting of five specialized agents (planning, coordination, resource allocation, code generation, analysis) each guided by configurable behavioral personas, evaluated via a three-dimensional decision-theoretic framework measuring normative compliance, prescriptive alignment, and behavioral dynamics.

If this is right

  • Persona-agent alignment improves individual performance in O-RAN optimization tasks by 14.3 percent.
  • Retrieval architecture fundamentally limits how effectively personas can customize agent behavior.
  • Single agent persona modifications lead to system-wide cascading effects on coordination.
  • Certain persona combinations exhibit fundamental incompatibilities that affect overall system performance.
  • Systematic pre-deployment validation is essential for safe use of LLM-based automation in telecommunications infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could extend to assessing safety in other multi-agent LLM applications beyond networks, such as in autonomous systems.
  • Testing in actual deployed O-RAN environments rather than simulations would strengthen the evidence for real-world applicability.
  • The observed propagation of changes suggests that agent design processes should incorporate system-level interaction modeling from the outset.
  • Insights on retrieval methods may inform best practices for building customizable multi-agent systems in other domains.

Load-bearing premise

The simulated O-RAN optimization challenges used in testing sufficiently represent the safety and alignment requirements of real-world network deployments.

What would settle it

Demonstrating equivalent performance and coordination outcomes between aligned and misaligned personas in live O-RAN testbeds would indicate that the measured impacts do not hold in operational settings.

Figures

Figures reproduced from arXiv: 2604.09682 by Louis Powell, Mallik Tatipamula, Maryam Hafeez, Syed Ali Raza Zaidi, Vara Prasad Talari, Zeinab Nezami.

Figure 1
Figure 1. Figure 1: Multi-agent system architecture showing orchestration, execution, and validation subsystems with knowledge layer integration. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Normative Analysis - Agent Evaluation Across Questions. Each subplot displays evaluation scores for different agent personas, organized by agent. Top row shows Q1 results; bottom row shows Q2 results. Scatter points represent individual evaluation runs, color-coded by persona. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prescriptive evaluation across Linguistic Habits, Persona Consistency, and Ethical Behavior for five agents under Q1 (top) and Q2 (bottom). Three [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Knowledge expansion trajectories across Q1 and Q2. Higher [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Code quality progression across three refinement runs. Overall [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Semantic embedding distances: (a) Planner context divergence across [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Autonomous network management in Open Radio Access Networks requires intelligent decision making across conflicting objectives, yet existing LLM based multi agent systems employ homogeneous strategies and lack systematic predeployment validation. We introduce a persona driven multi agent framework where configurable behavioral personas structured specifications encoding optimization priorities, risk tolerance, and decision making style influence five specialized agents (planning, coordination, resource allocation, code generation, analysis). To enable rigorous validation, we develop a three dimensional evaluation framework grounded in decision theory, measuring normative compliance (optimality adherence), prescriptive alignment (behavioral guideline consistency), and behavioral dynamics (emergent system properties). We evaluate 486 persona configurations across two ORAN optimization challenges (energy efficient resource allocation and network load balancing). Results demonstrate that persona agent alignment significantly impacts both individual performance (14.3 percent) and emergent multi agent coordination, with retrieval architecture (GraphRAG vs. RAG) fundamentally constraining customization effectiveness. Single agent persona modifications propagate system wide through cascading effects, with certain combinations exhibiting detectable fundamental incompatibilities. Our framework provides systematic validation mechanisms for deploying LLM based automation in mission critical telecommunications infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a persona-driven multi-agent framework for autonomous O-RAN management, in which configurable behavioral personas (encoding priorities, risk tolerance, and decision style) modulate five specialized agents. A three-dimensional evaluation framework grounded in decision theory measures normative compliance, prescriptive alignment, and behavioral dynamics. The authors evaluate 486 persona configurations across two simulated optimization problems (energy-efficient resource allocation and load balancing), reporting a 14.3% performance impact from persona alignment, emergent coordination effects, retrieval-architecture constraints (GraphRAG vs. RAG), and cascading incompatibilities.

Significance. If the simulation results prove representative, the work supplies a scalable, decision-theoretic methodology for pre-deployment safety assessment of LLM-based agents in networked systems, with the 486-configuration sweep and explicit treatment of cascading persona effects constituting a concrete empirical contribution to multi-agent alignment research in telecommunications.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: the headline claim that the three-dimensional framework supplies 'systematic validation mechanisms for deploying LLM-based automation in mission-critical telecommunications infrastructure' rests entirely on two abstract simulated optimization problems; no calibration against production O-RAN traces, hardware-in-the-loop measurements, or sensitivity analysis to real-time RIC latency, fronthaul jitter, or non-stationary traffic is reported, rendering the transferability assertion load-bearing yet unsupported.
  2. [Evaluation] Evaluation section: the reported 14.3% individual-performance delta and emergent-coordination findings lack accompanying error bars, statistical significance tests, or access to the underlying configuration data, preventing independent verification that the observed effects survive changes in simulator parameters.
minor comments (2)
  1. [Framework] The manuscript would benefit from an explicit table mapping the five agent roles to their persona-influenced decision variables.
  2. [Evaluation Framework] Notation for the three evaluation dimensions (normative, prescriptive, behavioral) is introduced without a concise mathematical definition or pseudocode for metric computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and rigor of our simulation-based study. We address each major point below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline claim that the three-dimensional framework supplies 'systematic validation mechanisms for deploying LLM-based automation in mission-critical telecommunications infrastructure' rests entirely on two abstract simulated optimization problems; no calibration against production O-RAN traces, hardware-in-the-loop measurements, or sensitivity analysis to real-time RIC latency, fronthaul jitter, or non-stationary traffic is reported, rendering the transferability assertion load-bearing yet unsupported.

    Authors: We agree that the evaluation relies on two simulated optimization problems and does not include calibration to production O-RAN traces or hardware-in-the-loop tests. The framework is presented as a decision-theoretic methodology for pre-deployment assessment rather than a fully validated production tool. In the revised manuscript we will qualify the abstract and evaluation claims to emphasize the simulation context, add an explicit limitations subsection discussing the absence of real-time RIC latency, fronthaul jitter, and non-stationary traffic sensitivity analyses, and outline planned future work for such calibration. This preserves the contribution while accurately bounding the transferability assertions. revision: yes

  2. Referee: [Evaluation] Evaluation section: the reported 14.3% individual-performance delta and emergent-coordination findings lack accompanying error bars, statistical significance tests, or access to the underlying configuration data, preventing independent verification that the observed effects survive changes in simulator parameters.

    Authors: We acknowledge the need for statistical rigor and reproducibility. In the revised version we will add error bars to all reported performance metrics, include statistical significance tests (paired t-tests and ANOVA) for the 14.3% delta and coordination effects, and release the full set of 486 configuration parameters together with simulation code and raw results via a public repository. We will also add a brief sensitivity discussion to simulator parameters. These changes directly address the verification concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines a persona-driven multi-agent framework and a three-dimensional evaluation framework (normative compliance, prescriptive alignment, behavioral dynamics) grounded in decision theory, then applies them to run 486 configurations on two simulated optimization problems and reports measured performance deltas. No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce the reported results (14.3% impact, emergent coordination effects) to the inputs by construction. The evaluation dimensions are stated as independent constructs, and outcomes are obtained from simulation execution rather than tautological re-derivation. The chain is self-contained against external simulation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that behavioral personas can be structured to encode and measurably influence agent priorities and styles in network optimization tasks. No free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Behavioral personas can be structured as specifications encoding optimization priorities, risk tolerance, and decision-making style that influence agent behavior
    Central premise for the persona-driven framework introduced in the abstract.

pith-pipeline@v0.9.0 · 5515 in / 1256 out tokens · 41948 ms · 2026-05-13T18:59:30.441883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

  1. [1]

    O-RAN Architecture Description

    O-RAN Alliance. O-RAN Architecture Description. O-RAN.WG1.O- RAN-Architecture-Description, 2022. Version 9.0

  2. [2]

    Intelligence and learning in O-RAN for data- driven NextG cellular networks.IEEE Communications Magazine, 59(10):35–41, 2021

    Leonardo Bonati, Michele Polese, Salvatore D’Oro, Stefano Basagni, and Tommaso Melodia. Intelligence and learning in O-RAN for data- driven NextG cellular networks.IEEE Communications Magazine, 59(10):35–41, 2021

  3. [3]

    £3.7bn: The Cost of Internet Failures to UK Businesses

    Beaming. £3.7bn: The Cost of Internet Failures to UK Businesses. https://www.beaming.co.uk/press-releases/ 3-7bn-the-cost-of-internet-failures-to-uk-businesses/, 2023. Accessed: 2025

  4. [4]

    Naydenov, N

    R. Naydenov, N. R. Carvalho, and E. Taurins. Telecom Security Inci- dents 2024. Technical report, European Union Agency for Cybersecurity (ENISA), 2025

  5. [5]

    GPT-4 technical report

    OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023

  6. [6]

    Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

  7. [7]

    Multi-agent reinforce- ment learning for network control.IEEE Access, 9:152288–152304, 2021

    Yu Wang, James Gil-Herrera, and David KY Yau. Multi-agent reinforce- ment learning for network control.IEEE Access, 9:152288–152304, 2021

  8. [8]

    Large language models for network configuration: Opportunities and challenges.IEEE Communications Magazine, 61(8):124–130, 2023

    Yifei Zhang, Jiawei Xu, Zhuo Wang, and Chen Li. Large language models for network configuration: Opportunities and challenges.IEEE Communications Magazine, 61(8):124–130, 2023

  9. [9]

    Bell, Howard Raiffa, and Amos Tversky.Decision Making: Descriptive, Normative, and Prescriptive Interactions

    David E. Bell, Howard Raiffa, and Amos Tversky.Decision Making: Descriptive, Normative, and Prescriptive Interactions. Cambridge Uni- versity Press, Cambridge, UK, 1988

  10. [10]

    The power of large language models for wireless communication system development: A case study on fpga platforms

    Yuyang Du, Hongyu Deng, Soung Chang Liew, Kexin Chen, Yulin Shao, and He Chen. The power of large language models for wireless communication system development: A case study on fpga platforms. arXiv preprint arXiv:2307.07319, 2023

  11. [11]

    Large language model enhanced multi-agent systems for 6g communications.arXiv preprint arXiv:2312.07850, 2023

    Feibo Jiang, Li Dong, Yubo Peng, Kezhi Wang, Kun Yang, Cunhua Pan, Dusit Niyato, and Octavia A Dobre. Large language model enhanced multi-agent systems for 6g communications.arXiv preprint arXiv:2312.07850, 2023

  12. [12]

    Cloudeval- yaml: A practical benchmark for cloud configuration generation.Pro- ceedings of Machine Learning and Systems, 6:173–195, 2024

    Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, et al. Cloudeval- yaml: A practical benchmark for cloud configuration generation.Pro- ceedings of Machine Learning and Systems, 6:173–195, 2024

  13. [13]

    Generative ai on the edge: Architecture and performance evalua- tion

    Zeinab Nezami, Maryam Hafeez, Karim Djemame, and Syed Ali Raza Zaidi. Generative ai on the edge: Architecture and performance evalua- tion. InICC 2025-IEEE International Conference on Communications, pages 4595–4602. IEEE, 2025

  14. [14]

    Descriptor: Benchmark dataset for generative ai on edge devices (bedged).IEEE Data Descriptions, 2025

    Zeinab Nezami, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi, and Jie Xu. Descriptor: Benchmark dataset for generative ai on edge devices (bedged).IEEE Data Descriptions, 2025

  15. [15]

    Generative ai meets wireless networking: An interactive paradigm for intent-driven communications.IEEE Transactions on Cognitive Communications and Networking, 2025

    Xiaoqi Qin, Mengying Sun, Jincheng Dai, Peixuan Ma, Yuecheng Cao, Jingjing Zhang, Jiacheng Wang, Xiaodong Xu, Ping Zhang, and Dusit Niyato. Generative ai meets wireless networking: An interactive paradigm for intent-driven communications.IEEE Transactions on Cognitive Communications and Networking, 2025

  16. [16]

    Taia: Telco generative ai-powered multi- agent assistant for managing cloud-native networks

    Grzegorz Panek, Piotr Matysiak, Marcin Zi ´ołkowski, Ilhem Fajjari, Cyril Auboin, and Iwona Wojan. Taia: Telco generative ai-powered multi- agent assistant for managing cloud-native networks. In2025 IEEE International Conference on Communications. Institute of Electrical and Electronics Engineers, 2025

  17. [17]

    Wireless multi-agent generative ai: From connected intelligence to collective intelligence.arXiv preprint arXiv:2307.02757, 2023

    Hang Zou, Qiyang Zhao, Lina Bariah, Mehdi Bennis, and Merouane Debbah. Wireless multi-agent generative ai: From connected intelligence to collective intelligence.arXiv preprint arXiv:2307.02757, 2023

  18. [18]

    The agentic-ai core: An ai-empowered, mission-oriented core network for next-generation mobile telecommunications.Engineering, 2025

    Xu Li, Weisen Shi, Hang Zhang, Chenghui Peng, Shaoyun Wu, and Wen Tong. The agentic-ai core: An ai-empowered, mission-oriented core network for next-generation mobile telecommunications.Engineering, 2025

  19. [19]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Li Deng, Chi Zhang, and Ahmed Hassan Awadallah. Autogen: Enabling next- gen llm applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

  20. [20]

    Safety and risk pathways in cooperative generative multi-agent systems: A telecom perspective

    Zeinab Nezami, Shehr Bano, Abdelaziz Salama, Maryam Hafeez, and Syed Ali Raza Zaidi. Safety and risk pathways in cooperative generative multi-agent systems: A telecom perspective. InOpenRan ’25: Proceed- ings of the 2nd ACM Workshop on Open and AI RAN, pages 50–55. ACM, November 2025

  21. [21]

    Chatbots and mental health: Insights into the safety of generative ai.Journal of Consumer Psychology, 34(3):481–491, 2024

    Julian De Freitas, Ahmet Kaan U ˘guralp, Zeliha O ˘guz-U˘guralp, and Stefano Puntoni. Chatbots and mental health: Insights into the safety of generative ai.Journal of Consumer Psychology, 34(3):481–491, 2024

  22. [22]

    Genai-based multi-agent reinforcement learning towards distributed agent intelligence: A generative-rl agent perspective.arXiv preprint arXiv:2507.09495, 2025

    Hang Wang and Junshan Zhang. Genai-based multi-agent reinforcement learning towards distributed agent intelligence: A generative-rl agent perspective.arXiv preprint arXiv:2507.09495, 2025

  23. [23]

    Safe systems with unsafe agents: Challenges and opportunities

    Jeremy Bellay, J Timothy Balint, Stephen A Boxwell, and Jeffrey Geppert. Safe systems with unsafe agents: Challenges and opportunities. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 2849–2853, 2025

  24. [24]

    METR: Model evaluation and threat research

    METR. METR: Model evaluation and threat research. https://metr.org,

  25. [25]

    Accessed: 2026-02-16

  26. [26]

    Safe and explainable ai in 6g telecommunication systems: The way forward

    Amadeu Do Nascimento J ´unior, Alexandros Palaios, Klaus Raizer, Swarup Kumar Mohalik, and Rafia Inam. Safe and explainable ai in 6g telecommunication systems: The way forward. In2025 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), pages 167–171. IEEE, 2025

  27. [27]

    Asleep at the keyboard? assessing the security of github copilot’s code contributions

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan- Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, 2022

  28. [28]

    Deploy, but verify: Analysing LLM generated code safety

    Rasmus Krebs and Somnath Mazumdar. Deploy, but verify: Analysing LLM generated code safety. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 13–16, 2025

  29. [29]

    LLM-based interactive code generation: Empirical evaluation

    Danil Shaikhelislamov, Mikhail Drobyshevskiy, and Andrey Belevant- sev. LLM-based interactive code generation: Empirical evaluation. In 2024 Ivannikov Ispras Open Conference (ISPRAS), pages 1–5, 2024

  30. [30]

    Next steps in LLM-supported java verification

    Samuel Teuber and Bernhard Beckert. Next steps in LLM-supported java verification. In2025 IEEE/ACM 1st International Workshop on Neuro-Symbolic Software Engineering (NSE), pages 1–4, 2025

  31. [31]

    Support- ing formal software verification with large language models: An exper- imental study

    Weiqi Wang, Marie Farrell, Lucas Cordeiro, and Liping Zhao. Support- ing formal software verification with large language models: An exper- imental study. In33rd IEEE International Requirements Engineering Conference (RE 2025), United States, June 2025. IEEE. 13

  32. [32]

    Toward auto-modeling of formal verifi- cation for nextg protocols: A multimodal cross- and self-attention large language model approach.IEEE Access, 12:27858–27869, 2024

    Jingda Yang and Ying Wang. Toward auto-modeling of formal verifi- cation for nextg protocols: A multimodal cross- and self-attention large language model approach.IEEE Access, 12:27858–27869, 2024

  33. [33]

    O-ran architecture description

    O-RAN Alliance. O-ran architecture description. Technical Report O- RAN.WG1.O-RAN-Architecture-Description-v11.00, O-RAN Alliance,

  34. [34]

    Available: https://www.o-ran.org/specifications

  35. [35]

    Towards objectively benchmarking social intelligence for language agents at action level.arXiv preprint arXiv:2404.05337, 2024

    Chenxu Wang, Bin Dai, Huaping Liu, and Baoyuan Wang. Towards objectively benchmarking social intelligence for language agents at action level.arXiv preprint arXiv:2404.05337, 2024

  36. [36]

    Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari

    Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. Personagym: Evaluating persona agents and llms.arXiv preprint arXiv:2407.18416, 2024

  37. [37]

    Alympics: Language agents meet game theory.arXiv preprint arXiv:2311.03220, 2023

    Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, and Furu Wei. Alympics: Language agents meet game theory.arXiv preprint arXiv:2311.03220, 2023

  38. [38]

    Inspect ai: A framework for large-scale ai model evaluation

    UK AI Safety Institute. Inspect ai: A framework for large-scale ai model evaluation. https://ukgovernmentbeis.github.io/inspect ai/, 2024

  39. [39]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  40. [40]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

  41. [41]

    A network arena for benchmarking ai agents on network troubleshooting, 2025

    Zhihao Wang, Alessandro Cornacchia, Alessio Sacco, Franco Galante, Marco Canini, and Dingde Jiang. A network arena for benchmarking ai agents on network troubleshooting, 2025

  42. [42]

    GSMA open telco evaluation suite

    GSMA Labs. GSMA open telco evaluation suite. https://github.com/ gsma-labs/evals, 2024. Accessed: 2026-02-16

  43. [43]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

  44. [44]

    Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self- collaboration

    Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self- collaboration. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

  45. [45]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  46. [46]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

  47. [47]

    Herbert A. Simon. A behavioral model of rational choice.The Quarterly Journal of Economics, 69(1):99–118, 1955

  48. [48]

    No country for old members: User lifecycle and linguistic change in online communities

    Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. No country for old members: User lifecycle and linguistic change in online communities. InProceedings of the 22nd International Conference on World Wide Web, pages 307–

  49. [49]

    A persona-based neural conversation model

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 994–1003. Association for Computational Linguistics, 2016

  50. [50]

    Bratman.Intention, Plans, and Practical Reason

    Michael E. Bratman.Intention, Plans, and Practical Reason. Harvard University Press, Cambridge, MA, 1987

  51. [51]

    Rao and Michael P

    Anand S. Rao and Michael P. Georgeff. Bdi agents: From theory to practice. InProceedings of the First International Conference on Multi- Agent Systems (ICMAS-95), pages 312–319. AAAI Press, 1995

  52. [52]

    Bruce J. Biddle. Recent developments in role theory.Annual Review of Sociology, 12(1):67–92, 1986

  53. [53]

    Springer Nature, Cham, Switzerland, 2019

    Virginia Dignum.Responsible Artificial Intelligence: How to Develop and Use AI in a Responsible Way. Springer Nature, Cham, Switzerland, 2019

  54. [54]

    Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

    Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, Burkhard Schafer, Peggy Valcke, and Effy Vayena. Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

  55. [55]

    Ethical judgment of agents’ behaviors in multi-agent systems

    Nicolas Cointe, Gr ´egory Bonnet, and Olivier Boissier. Ethical judgment of agents’ behaviors in multi-agent systems. InProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1106–1114. International Foundation for Autonomous Agents and Multiagent Systems, 2016

  56. [56]

    Distributed Systems: Concepts and Design

    George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems: Concepts and Design. Addison-Wesley, 5th edition, 2011

  57. [57]

    Allen Lane, The Penguin Press, 1972

    Stafford Beer.Brain of the Firm: The Managerial Cybernetics of Organization. Allen Lane, The Penguin Press, 1972