arxiv: 2604.09682 · v1 · submitted 2026-04-03 · 💻 cs.NI · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Decision-Theoretic Safety Assessment of Persona-Driven Multi-Agent Systems in O-RAN

Zeinab Nezami , Syed Ali Raza Zaidi , Maryam Hafeez , Louis Powell , Vara Prasad Talari , Mallik Tatipamula

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:59 UTC · model grok-4.3

classification 💻 cs.NI cs.AI

keywords multi-agent systemsO-RANpersona-driven agentsdecision-theoretic evaluationLLM safetynetwork managementretrieval augmented generationagent alignment

0 comments

The pith

Persona alignment in O-RAN multi-agent systems boosts individual performance by 14.3 percent while shaping emergent coordination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a persona-driven approach for multi-agent systems managing Open Radio Access Networks, where each agent receives a behavioral specification that sets its priorities and decision style. By testing hundreds of these configurations on tasks like energy-efficient resource allocation and load balancing, the work shows that good persona matching improves how well agents perform and how they work together as a group. A decision-theory-based evaluation checks three aspects: whether agents follow optimal strategies, stick to their assigned guidelines, and produce desirable group-level behaviors. The findings indicate that retrieval methods used to customize the agents also play a big role in how much the personas can actually change the system's results. Such validation matters for safely using large language models to automate critical network management without constant human intervention.

Core claim

The paper establishes that in persona-driven multi-agent systems for O-RAN, alignment between agent personas and roles significantly impacts individual performance by up to 14.3 percent and influences emergent multi-agent coordination properties. The evaluation across 486 configurations on energy efficient resource allocation and network load balancing reveals that retrieval architecture choices constrain the effectiveness of persona customization, while single-agent modifications can propagate through the system leading to detectable incompatibilities in certain combinations.

What carries the argument

The persona-driven multi-agent framework consisting of five specialized agents (planning, coordination, resource allocation, code generation, analysis) each guided by configurable behavioral personas, evaluated via a three-dimensional decision-theoretic framework measuring normative compliance, prescriptive alignment, and behavioral dynamics.

If this is right

Persona-agent alignment improves individual performance in O-RAN optimization tasks by 14.3 percent.
Retrieval architecture fundamentally limits how effectively personas can customize agent behavior.
Single agent persona modifications lead to system-wide cascading effects on coordination.
Certain persona combinations exhibit fundamental incompatibilities that affect overall system performance.
Systematic pre-deployment validation is essential for safe use of LLM-based automation in telecommunications infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framework could extend to assessing safety in other multi-agent LLM applications beyond networks, such as in autonomous systems.
Testing in actual deployed O-RAN environments rather than simulations would strengthen the evidence for real-world applicability.
The observed propagation of changes suggests that agent design processes should incorporate system-level interaction modeling from the outset.
Insights on retrieval methods may inform best practices for building customizable multi-agent systems in other domains.

Load-bearing premise

The simulated O-RAN optimization challenges used in testing sufficiently represent the safety and alignment requirements of real-world network deployments.

What would settle it

Demonstrating equivalent performance and coordination outcomes between aligned and misaligned personas in live O-RAN testbeds would indicate that the measured impacts do not hold in operational settings.

Figures

Figures reproduced from arXiv: 2604.09682 by Louis Powell, Mallik Tatipamula, Maryam Hafeez, Syed Ali Raza Zaidi, Vara Prasad Talari, Zeinab Nezami.

**Figure 1.** Figure 1: Multi-agent system architecture showing orchestration, execution, and validation subsystems with knowledge layer integration. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 1.** Figure 1: Normative Analysis - Agent Evaluation Across Questions. Each subplot displays evaluation scores for different agent personas, organized by agent. Top row shows Q1 results; bottom row shows Q2 results. Scatter points represent individual evaluation runs, color-coded by persona. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 3.** Figure 3: Prescriptive evaluation across Linguistic Habits, Persona Consistency, and Ethical Behavior for five agents under Q1 (top) and Q2 (bottom). Three [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Knowledge expansion trajectories across Q1 and Q2. Higher [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Code quality progression across three refinement runs. Overall [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Semantic embedding distances: (a) Planner context divergence across [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Autonomous network management in Open Radio Access Networks requires intelligent decision making across conflicting objectives, yet existing LLM based multi agent systems employ homogeneous strategies and lack systematic predeployment validation. We introduce a persona driven multi agent framework where configurable behavioral personas structured specifications encoding optimization priorities, risk tolerance, and decision making style influence five specialized agents (planning, coordination, resource allocation, code generation, analysis). To enable rigorous validation, we develop a three dimensional evaluation framework grounded in decision theory, measuring normative compliance (optimality adherence), prescriptive alignment (behavioral guideline consistency), and behavioral dynamics (emergent system properties). We evaluate 486 persona configurations across two ORAN optimization challenges (energy efficient resource allocation and network load balancing). Results demonstrate that persona agent alignment significantly impacts both individual performance (14.3 percent) and emergent multi agent coordination, with retrieval architecture (GraphRAG vs. RAG) fundamentally constraining customization effectiveness. Single agent persona modifications propagate system wide through cascading effects, with certain combinations exhibiting detectable fundamental incompatibilities. Our framework provides systematic validation mechanisms for deploying LLM based automation in mission critical telecommunications infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Persona-driven multi-agent O-RAN setup with a three-axis decision-theoretic check is new, but the 14.3 percent claim and safety assertions rest only on uncalibrated simulations.

read the letter

The paper puts configurable personas on five specialized agents in an O-RAN multi-agent system and scores the resulting behavior on normative compliance, prescriptive alignment, and behavioral dynamics. They test 486 combinations inside two simulated optimization tasks and report that persona alignment shifts individual performance by 14.3 percent while also changing how the agents coordinate, with retrieval architecture mattering a lot and some persona mixes showing clear incompatibilities. Single-agent changes can cascade through the whole system. That combination of scale and structured scoring is the concrete step forward from earlier homogeneous LLM-agent work in this domain. The evaluation dimensions are defined up front rather than fitted after the fact, which keeps the logic cleaner than many ad-hoc safety checks. The main limitation is that both test problems are abstract optimization challenges with no reported calibration against production O-RAN traces, real fronthaul jitter, or hardware power curves. The claim that the framework supplies systematic validation for mission-critical infrastructure therefore depends on an untested transfer from simulation to live network. Without sensitivity analysis on those parameters or hardware-in-the-loop results, the performance deltas are difficult to read as deployment guidance. This is for researchers already working on LLM agents for network management or on safety evaluation of autonomous systems in telecom. A reader who wants concrete ideas for persona specification and multi-dimensional scoring will get usable material even if they later decide the simulation gap needs closing. It is coherent enough on its own terms to go to a serious referee, provided the authors are asked to document the simulator fidelity and any sensitivity checks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a persona-driven multi-agent framework for autonomous O-RAN management, in which configurable behavioral personas (encoding priorities, risk tolerance, and decision style) modulate five specialized agents. A three-dimensional evaluation framework grounded in decision theory measures normative compliance, prescriptive alignment, and behavioral dynamics. The authors evaluate 486 persona configurations across two simulated optimization problems (energy-efficient resource allocation and load balancing), reporting a 14.3% performance impact from persona alignment, emergent coordination effects, retrieval-architecture constraints (GraphRAG vs. RAG), and cascading incompatibilities.

Significance. If the simulation results prove representative, the work supplies a scalable, decision-theoretic methodology for pre-deployment safety assessment of LLM-based agents in networked systems, with the 486-configuration sweep and explicit treatment of cascading persona effects constituting a concrete empirical contribution to multi-agent alignment research in telecommunications.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: the headline claim that the three-dimensional framework supplies 'systematic validation mechanisms for deploying LLM-based automation in mission-critical telecommunications infrastructure' rests entirely on two abstract simulated optimization problems; no calibration against production O-RAN traces, hardware-in-the-loop measurements, or sensitivity analysis to real-time RIC latency, fronthaul jitter, or non-stationary traffic is reported, rendering the transferability assertion load-bearing yet unsupported.
[Evaluation] Evaluation section: the reported 14.3% individual-performance delta and emergent-coordination findings lack accompanying error bars, statistical significance tests, or access to the underlying configuration data, preventing independent verification that the observed effects survive changes in simulator parameters.

minor comments (2)

[Framework] The manuscript would benefit from an explicit table mapping the five agent roles to their persona-influenced decision variables.
[Evaluation Framework] Notation for the three evaluation dimensions (normative, prescriptive, behavioral) is introduced without a concise mathematical definition or pseudocode for metric computation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and rigor of our simulation-based study. We address each major point below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline claim that the three-dimensional framework supplies 'systematic validation mechanisms for deploying LLM-based automation in mission-critical telecommunications infrastructure' rests entirely on two abstract simulated optimization problems; no calibration against production O-RAN traces, hardware-in-the-loop measurements, or sensitivity analysis to real-time RIC latency, fronthaul jitter, or non-stationary traffic is reported, rendering the transferability assertion load-bearing yet unsupported.

Authors: We agree that the evaluation relies on two simulated optimization problems and does not include calibration to production O-RAN traces or hardware-in-the-loop tests. The framework is presented as a decision-theoretic methodology for pre-deployment assessment rather than a fully validated production tool. In the revised manuscript we will qualify the abstract and evaluation claims to emphasize the simulation context, add an explicit limitations subsection discussing the absence of real-time RIC latency, fronthaul jitter, and non-stationary traffic sensitivity analyses, and outline planned future work for such calibration. This preserves the contribution while accurately bounding the transferability assertions. revision: yes
Referee: [Evaluation] Evaluation section: the reported 14.3% individual-performance delta and emergent-coordination findings lack accompanying error bars, statistical significance tests, or access to the underlying configuration data, preventing independent verification that the observed effects survive changes in simulator parameters.

Authors: We acknowledge the need for statistical rigor and reproducibility. In the revised version we will add error bars to all reported performance metrics, include statistical significance tests (paired t-tests and ANOVA) for the 14.3% delta and coordination effects, and release the full set of 486 configuration parameters together with simulation code and raw results via a public repository. We will also add a brief sensitivity discussion to simulator parameters. These changes directly address the verification concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines a persona-driven multi-agent framework and a three-dimensional evaluation framework (normative compliance, prescriptive alignment, behavioral dynamics) grounded in decision theory, then applies them to run 486 configurations on two simulated optimization problems and reports measured performance deltas. No equations, fitted parameters renamed as predictions, or self-citations are shown that reduce the reported results (14.3% impact, emergent coordination effects) to the inputs by construction. The evaluation dimensions are stated as independent constructs, and outcomes are obtained from simulation execution rather than tautological re-derivation. The chain is self-contained against external simulation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that behavioral personas can be structured to encode and measurably influence agent priorities and styles in network optimization tasks. No free parameters or invented entities are specified in the abstract.

axioms (1)

domain assumption Behavioral personas can be structured as specifications encoding optimization priorities, risk tolerance, and decision-making style that influence agent behavior
Central premise for the persona-driven framework introduced in the abstract.

pith-pipeline@v0.9.0 · 5515 in / 1256 out tokens · 41948 ms · 2026-05-13T18:59:30.441883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-dimensional evaluation framework grounded in decision theory, measuring normative compliance (optimality adherence), prescriptive alignment (behavioral guideline consistency), and behavioral dynamics (emergent system properties)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

O-RAN Architecture Description

O-RAN Alliance. O-RAN Architecture Description. O-RAN.WG1.O- RAN-Architecture-Description, 2022. Version 9.0

work page 2022
[2]

Intelligence and learning in O-RAN for data- driven NextG cellular networks.IEEE Communications Magazine, 59(10):35–41, 2021

Leonardo Bonati, Michele Polese, Salvatore D’Oro, Stefano Basagni, and Tommaso Melodia. Intelligence and learning in O-RAN for data- driven NextG cellular networks.IEEE Communications Magazine, 59(10):35–41, 2021

work page 2021
[3]

£3.7bn: The Cost of Internet Failures to UK Businesses

Beaming. £3.7bn: The Cost of Internet Failures to UK Businesses. https://www.beaming.co.uk/press-releases/ 3-7bn-the-cost-of-internet-failures-to-uk-businesses/, 2023. Accessed: 2025

work page 2023
[4]

Naydenov, N

R. Naydenov, N. R. Carvalho, and E. Taurins. Telecom Security Inci- dents 2024. Technical report, European Union Agency for Cybersecurity (ENISA), 2025

work page 2024
[5]

GPT-4 technical report

OpenAI. GPT-4 technical report. Technical report, OpenAI, 2023

work page 2023
[6]

Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[7]

Multi-agent reinforce- ment learning for network control.IEEE Access, 9:152288–152304, 2021

Yu Wang, James Gil-Herrera, and David KY Yau. Multi-agent reinforce- ment learning for network control.IEEE Access, 9:152288–152304, 2021

work page 2021
[8]

Large language models for network configuration: Opportunities and challenges.IEEE Communications Magazine, 61(8):124–130, 2023

Yifei Zhang, Jiawei Xu, Zhuo Wang, and Chen Li. Large language models for network configuration: Opportunities and challenges.IEEE Communications Magazine, 61(8):124–130, 2023

work page 2023
[9]

Bell, Howard Raiffa, and Amos Tversky.Decision Making: Descriptive, Normative, and Prescriptive Interactions

David E. Bell, Howard Raiffa, and Amos Tversky.Decision Making: Descriptive, Normative, and Prescriptive Interactions. Cambridge Uni- versity Press, Cambridge, UK, 1988

work page 1988
[10]

The power of large language models for wireless communication system development: A case study on fpga platforms

Yuyang Du, Hongyu Deng, Soung Chang Liew, Kexin Chen, Yulin Shao, and He Chen. The power of large language models for wireless communication system development: A case study on fpga platforms. arXiv preprint arXiv:2307.07319, 2023

work page arXiv 2023
[11]

Large language model enhanced multi-agent systems for 6g communications.arXiv preprint arXiv:2312.07850, 2023

Feibo Jiang, Li Dong, Yubo Peng, Kezhi Wang, Kun Yang, Cunhua Pan, Dusit Niyato, and Octavia A Dobre. Large language model enhanced multi-agent systems for 6g communications.arXiv preprint arXiv:2312.07850, 2023

work page arXiv 2023
[12]

Cloudeval- yaml: A practical benchmark for cloud configuration generation.Pro- ceedings of Machine Learning and Systems, 6:173–195, 2024

Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, et al. Cloudeval- yaml: A practical benchmark for cloud configuration generation.Pro- ceedings of Machine Learning and Systems, 6:173–195, 2024

work page 2024
[13]

Generative ai on the edge: Architecture and performance evalua- tion

Zeinab Nezami, Maryam Hafeez, Karim Djemame, and Syed Ali Raza Zaidi. Generative ai on the edge: Architecture and performance evalua- tion. InICC 2025-IEEE International Conference on Communications, pages 4595–4602. IEEE, 2025

work page 2025
[14]

Descriptor: Benchmark dataset for generative ai on edge devices (bedged).IEEE Data Descriptions, 2025

Zeinab Nezami, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi, and Jie Xu. Descriptor: Benchmark dataset for generative ai on edge devices (bedged).IEEE Data Descriptions, 2025

work page 2025
[15]

Generative ai meets wireless networking: An interactive paradigm for intent-driven communications.IEEE Transactions on Cognitive Communications and Networking, 2025

Xiaoqi Qin, Mengying Sun, Jincheng Dai, Peixuan Ma, Yuecheng Cao, Jingjing Zhang, Jiacheng Wang, Xiaodong Xu, Ping Zhang, and Dusit Niyato. Generative ai meets wireless networking: An interactive paradigm for intent-driven communications.IEEE Transactions on Cognitive Communications and Networking, 2025

work page 2025
[16]

Taia: Telco generative ai-powered multi- agent assistant for managing cloud-native networks

Grzegorz Panek, Piotr Matysiak, Marcin Zi ´ołkowski, Ilhem Fajjari, Cyril Auboin, and Iwona Wojan. Taia: Telco generative ai-powered multi- agent assistant for managing cloud-native networks. In2025 IEEE International Conference on Communications. Institute of Electrical and Electronics Engineers, 2025

work page 2025
[17]

Wireless multi-agent generative ai: From connected intelligence to collective intelligence.arXiv preprint arXiv:2307.02757, 2023

Hang Zou, Qiyang Zhao, Lina Bariah, Mehdi Bennis, and Merouane Debbah. Wireless multi-agent generative ai: From connected intelligence to collective intelligence.arXiv preprint arXiv:2307.02757, 2023

work page arXiv 2023
[18]

The agentic-ai core: An ai-empowered, mission-oriented core network for next-generation mobile telecommunications.Engineering, 2025

Xu Li, Weisen Shi, Hang Zhang, Chenghui Peng, Shaoyun Wu, and Wen Tong. The agentic-ai core: An ai-empowered, mission-oriented core network for next-generation mobile telecommunications.Engineering, 2025

work page 2025
[19]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Li Deng, Chi Zhang, and Ahmed Hassan Awadallah. Autogen: Enabling next- gen llm applications via multi-agent conversation.arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Safety and risk pathways in cooperative generative multi-agent systems: A telecom perspective

Zeinab Nezami, Shehr Bano, Abdelaziz Salama, Maryam Hafeez, and Syed Ali Raza Zaidi. Safety and risk pathways in cooperative generative multi-agent systems: A telecom perspective. InOpenRan ’25: Proceed- ings of the 2nd ACM Workshop on Open and AI RAN, pages 50–55. ACM, November 2025

work page 2025
[21]

Chatbots and mental health: Insights into the safety of generative ai.Journal of Consumer Psychology, 34(3):481–491, 2024

Julian De Freitas, Ahmet Kaan U ˘guralp, Zeliha O ˘guz-U˘guralp, and Stefano Puntoni. Chatbots and mental health: Insights into the safety of generative ai.Journal of Consumer Psychology, 34(3):481–491, 2024

work page 2024
[22]

Genai-based multi-agent reinforcement learning towards distributed agent intelligence: A generative-rl agent perspective.arXiv preprint arXiv:2507.09495, 2025

Hang Wang and Junshan Zhang. Genai-based multi-agent reinforcement learning towards distributed agent intelligence: A generative-rl agent perspective.arXiv preprint arXiv:2507.09495, 2025

work page arXiv 2025
[23]

Safe systems with unsafe agents: Challenges and opportunities

Jeremy Bellay, J Timothy Balint, Stephen A Boxwell, and Jeffrey Geppert. Safe systems with unsafe agents: Challenges and opportunities. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 2849–2853, 2025

work page 2025
[24]

METR: Model evaluation and threat research

METR. METR: Model evaluation and threat research. https://metr.org,

work page
[25]

Accessed: 2026-02-16

work page 2026
[26]

Safe and explainable ai in 6g telecommunication systems: The way forward

Amadeu Do Nascimento J ´unior, Alexandros Palaios, Klaus Raizer, Swarup Kumar Mohalik, and Rafia Inam. Safe and explainable ai in 6g telecommunication systems: The way forward. In2025 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), pages 167–171. IEEE, 2025

work page 2025
[27]

Asleep at the keyboard? assessing the security of github copilot’s code contributions

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan- Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, 2022

work page 2022
[28]

Deploy, but verify: Analysing LLM generated code safety

Rasmus Krebs and Somnath Mazumdar. Deploy, but verify: Analysing LLM generated code safety. In2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), pages 13–16, 2025

work page 2025
[29]

LLM-based interactive code generation: Empirical evaluation

Danil Shaikhelislamov, Mikhail Drobyshevskiy, and Andrey Belevant- sev. LLM-based interactive code generation: Empirical evaluation. In 2024 Ivannikov Ispras Open Conference (ISPRAS), pages 1–5, 2024

work page 2024
[30]

Next steps in LLM-supported java verification

Samuel Teuber and Bernhard Beckert. Next steps in LLM-supported java verification. In2025 IEEE/ACM 1st International Workshop on Neuro-Symbolic Software Engineering (NSE), pages 1–4, 2025

work page 2025
[31]

Support- ing formal software verification with large language models: An exper- imental study

Weiqi Wang, Marie Farrell, Lucas Cordeiro, and Liping Zhao. Support- ing formal software verification with large language models: An exper- imental study. In33rd IEEE International Requirements Engineering Conference (RE 2025), United States, June 2025. IEEE. 13

work page 2025
[32]

Toward auto-modeling of formal verifi- cation for nextg protocols: A multimodal cross- and self-attention large language model approach.IEEE Access, 12:27858–27869, 2024

Jingda Yang and Ying Wang. Toward auto-modeling of formal verifi- cation for nextg protocols: A multimodal cross- and self-attention large language model approach.IEEE Access, 12:27858–27869, 2024

work page 2024
[33]

O-ran architecture description

O-RAN Alliance. O-ran architecture description. Technical Report O- RAN.WG1.O-RAN-Architecture-Description-v11.00, O-RAN Alliance,

work page
[34]

Available: https://www.o-ran.org/specifications

work page
[35]

Towards objectively benchmarking social intelligence for language agents at action level.arXiv preprint arXiv:2404.05337, 2024

Chenxu Wang, Bin Dai, Huaping Liu, and Baoyuan Wang. Towards objectively benchmarking social intelligence for language agents at action level.arXiv preprint arXiv:2404.05337, 2024

work page arXiv 2024
[36]

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. Personagym: Evaluating persona agents and llms.arXiv preprint arXiv:2407.18416, 2024

work page arXiv 2024
[37]

Alympics: Language agents meet game theory.arXiv preprint arXiv:2311.03220, 2023

Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, and Furu Wei. Alympics: Language agents meet game theory.arXiv preprint arXiv:2311.03220, 2023

work page arXiv 2023
[38]

Inspect ai: A framework for large-scale ai model evaluation

UK AI Safety Institute. Inspect ai: A framework for large-scale ai model evaluation. https://ukgovernmentbeis.github.io/inspect ai/, 2024

work page 2024
[39]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[40]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

work page 2023
[41]

A network arena for benchmarking ai agents on network troubleshooting, 2025

Zhihao Wang, Alessandro Cornacchia, Alessio Sacco, Franco Galante, Marco Canini, and Dingde Jiang. A network arena for benchmarking ai agents on network troubleshooting, 2025

work page 2025
[42]

GSMA open telco evaluation suite

GSMA Labs. GSMA open telco evaluation suite. https://github.com/ gsma-labs/evals, 2024. Accessed: 2026-02-16

work page 2024
[43]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[44]

Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self- collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self- collaboration. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page 2024
[45]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 1901
[46]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022
[47]

Herbert A. Simon. A behavioral model of rational choice.The Quarterly Journal of Economics, 69(1):99–118, 1955

work page 1955
[48]

No country for old members: User lifecycle and linguistic change in online communities

Cristian Danescu-Niculescu-Mizil, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. No country for old members: User lifecycle and linguistic change in online communities. InProceedings of the 22nd International Conference on World Wide Web, pages 307–

work page
[49]

A persona-based neural conversation model

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 994–1003. Association for Computational Linguistics, 2016

work page 2016
[50]

Bratman.Intention, Plans, and Practical Reason

Michael E. Bratman.Intention, Plans, and Practical Reason. Harvard University Press, Cambridge, MA, 1987

work page 1987
[51]

Rao and Michael P

Anand S. Rao and Michael P. Georgeff. Bdi agents: From theory to practice. InProceedings of the First International Conference on Multi- Agent Systems (ICMAS-95), pages 312–319. AAAI Press, 1995

work page 1995
[52]

Bruce J. Biddle. Recent developments in role theory.Annual Review of Sociology, 12(1):67–92, 1986

work page 1986
[53]

Springer Nature, Cham, Switzerland, 2019

Virginia Dignum.Responsible Artificial Intelligence: How to Develop and Use AI in a Responsible Way. Springer Nature, Cham, Switzerland, 2019

work page 2019
[54]

Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

Luciano Floridi, Josh Cowls, Monica Beltrametti, Raja Chatila, Patrice Chazerand, Virginia Dignum, Christoph Luetge, Robert Madelin, Ugo Pagallo, Francesca Rossi, Burkhard Schafer, Peggy Valcke, and Effy Vayena. Ai4people—an ethical framework for a good ai society: Opportunities, risks, principles, and recommendations.Minds and Machines, 28(4):689–707, 2018

work page 2018
[55]

Ethical judgment of agents’ behaviors in multi-agent systems

Nicolas Cointe, Gr ´egory Bonnet, and Olivier Boissier. Ethical judgment of agents’ behaviors in multi-agent systems. InProceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1106–1114. International Foundation for Autonomous Agents and Multiagent Systems, 2016

work page 2016
[56]

Distributed Systems: Concepts and Design

George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems: Concepts and Design. Addison-Wesley, 5th edition, 2011

work page 2011
[57]

Allen Lane, The Penguin Press, 1972

Stafford Beer.Brain of the Firm: The Managerial Cybernetics of Organization. Allen Lane, The Penguin Press, 1972

work page 1972