arxiv: 2604.13103 · v1 · submitted 2026-04-10 · 💻 cs.SE · cs.MA

Recognition: 1 theorem link

· Lean Theorem

Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review

Corey Yang-Smith , Ronnie de Souza Santos , Ahmad Abdellatif

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:07 UTC · model grok-4.3

classification 💻 cs.SE cs.MA

keywords fairnessmulti-agent systemssoftware engineeringsoftware development lifecyclelarge language modelsbiasrapid review

0 comments

The pith

Fairness research on multi-agent systems for software engineering remains too fragmented and limited to support reliable fair systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a rapid review of studies on fairness in multi-agent systems that use large language models within software development. It shows that fairness gets defined through bias reduction and group interaction rules, yet measurements differ across papers and often rely on narrow test cases rather than full workflows. The authors map reported harms to stages of software creation and find few tested fixes that fit real development processes. A reader would care because these systems already shape code writing, review, and release, so unresolved fairness issues risk producing biased tools and products. The review ends by urging more consistent benchmarks and governance that cover the entire development cycle.

Core claim

Screening 350 papers to 18 relevant studies reveals that fairness in LLM-enabled multi-agent systems combines trustworthy AI principles, bias reduction across groups, and interactional dynamics in collectives, with evaluation using accuracy metrics, demographic disparities, and notions like conformity and bias amplification. Reported harms span representational, quality-of-service, security and privacy, and governance failures, yet the field shows fragmented evaluation practices, limited generalization from simplified environments, and scarce mitigation mechanisms aligned to actual software workflows.

What carries the argument

The rapid review's synthesis of three gaps—fragmented evaluation that blocks comparison, limited generalization from narrow setups, and underdeveloped mitigation and governance tied to real software processes—drawn from the 18 analyzed studies.

If this is right

MAS-aware benchmarks would allow direct comparison of fairness results across different agent systems and settings.
Standardized evaluation protocols would replace the current mix of accuracy checks and disparity measures.
Governance approaches that run through all stages of software creation would address the current scarcity of practical fixes.
Specific attention to harms such as bias amplification in agent groups and privacy failures would become part of future software engineering work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Without progress on these gaps, teams adopting multi-agent systems for code tasks may embed undetected biases into production software.
Early fairness checks built into agent-based tools could limit the spread of quality-of-service harms to later development stages.
Closer ties between fairness evaluation methods and standard software testing practices would make mitigation easier to apply in daily workflows.

Load-bearing premise

The screening of 350 papers down to 18 and the qualitative reading of their content accurately reflects the main patterns and shortfalls in current fairness work on multi-agent systems for software engineering.

What would settle it

A set of studies that apply the same fairness measures and tested fixes across multiple real software development stages in multi-agent systems would challenge the finding that the research cannot yet support deployable fair systems.

Figures

Figures reproduced from arXiv: 2604.13103 by Ahmad Abdellatif, Corey Yang-Smith, Ronnie de Souza Santos.

**Figure 2.** Figure 2: Search string used across Google Scholar during [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Study selection flow for the rapid review. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Transformer-based large language models (LLMs) and multi-agent systems (MAS) are increasingly embedded across the software development lifecycle (SDLC), yet their fairness implications for developer-facing tools remain underexplored despite their growing role in shaping what code is written, reviewed, and released. We present a rapid review of recent work on fairness in MAS, emphasizing LLM-enabled settings and relevance to software engineering. Starting from an initial set of 350 papers, we screened and filtered the corpus for relevance, retaining 18 studies for final analysis. Across these 18 studies, fairness is framed as a combination of trustworthy AI principles, bias reduction across groups, and interactional dynamics in collectives, while evaluation spans accuracy metrics on bias benchmarks, demographic disparity measures, and emergent MAS-specific notions such as conformity and bias amplification. Reported harms include representational, quality-of-service, security and privacy, and governance failures, which we relate to SDLC stages where evidence is most and least developed. We identify three persistent gaps: (1) fragmented, rarely MAS-specific evaluation practices that limit comparability, (2) limited generalization due to simplified environments and narrow attribute coverage, and (3) scarce, weakly evaluated mitigation and governance mechanisms aligned to real software workflows. These findings suggest MAS fairness research is not yet ready to support deployable, fairness-assured software systems, motivating MAS-aware benchmarks, consistent protocols, and lifecycle-spanning governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A straightforward gap map for fairness in LLM-MAS across SDLC stages, but the thin screening details make the 'not ready' claim hard to trust without more proof the 18 papers are representative.

read the letter

The paper pulls together work on fairness in multi-agent systems that use LLMs for software engineering tasks. It screens an initial 350 papers down to 18 and flags three gaps: evaluation methods that rarely match across studies, results that stay stuck in simplified setups with narrow attributes, and almost no mitigation or governance steps that have been tested in actual development workflows. The SDLC framing is the useful part here. It shows where evidence clusters (mostly early stages like coding) and where it thins out (later stages like deployment), and it groups harms into clear buckets like representational bias, quality-of-service issues, and governance failures. That organization gives a practical way to see what is missing without overclaiming new theory or data. The soft spot is the review process itself. The numbers are stated, but there is no detail on search strings, databases, date ranges, or how relevance to LLM-enabled MAS and SDLC was decided. Without those steps documented, it is easy to worry that the gaps reflect what turned up in a limited search rather than the actual state of the literature. If key mitigation papers or benchmark efforts were missed, the conclusion that the field cannot yet support deployable systems becomes less solid. This is the kind of paper that helps people in AI ethics and software engineering subfields decide where to aim next studies. It will not change core practices or give new tools, but it can steer priorities toward MAS-specific benchmarks and lifecycle governance. I would bring it to a reading group to talk through the gaps and whether the selection holds up. It does not have the primary results or formal verification to cite as a core reference, but it deserves peer review so the methods can be checked and strengthened. A revised version with a clearer protocol would be a reasonable contribution for guiding the subfield.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a rapid review of fairness in multi-agent systems (MAS), with emphasis on LLM-enabled agents and relevance to the software development lifecycle (SDLC). From an initial corpus of 350 papers, 18 studies are retained after screening. The review synthesizes fairness framings (trustworthy AI principles, group bias reduction, and collective interaction dynamics), evaluation approaches (accuracy on bias benchmarks, demographic disparity metrics, and MAS-specific notions such as conformity and bias amplification), reported harms (representational, quality-of-service, security/privacy, and governance failures), and their mapping to SDLC stages. Three gaps are identified: fragmented and rarely MAS-specific evaluation practices, limited generalization from simplified environments and narrow attribute coverage, and scarce weakly-evaluated mitigation and governance mechanisms. The central claim is that MAS fairness research is not yet ready to support deployable, fairness-assured software systems, motivating MAS-aware benchmarks, consistent protocols, and lifecycle-spanning governance.

Significance. If the identified gaps prove representative, the work is significant as a timely structured synthesis at the intersection of AI fairness, multi-agent systems, and software engineering. It usefully relates harms to specific SDLC stages and articulates concrete, actionable directions for future research. The review's contribution is strengthened by its focus on LLM-enabled MAS, an area of growing practical importance, though its overall impact hinges on the transparency and completeness of the underlying literature selection.

major comments (1)

[Methods / rapid review description] The rapid review methodology (abstract and methods section): the process that reduces 350 papers to 18 is stated at a high level but provides no details on search strings, databases, date ranges, inclusion/exclusion criteria, or inter-rater agreement metrics. This is load-bearing for the central claim because the three gaps and the conclusion that the field is 'not yet ready' rest on the assumption that the selected studies accurately represent the state of LLM-enabled MAS fairness work; without a documented protocol it is impossible to rule out systematic under-sampling of mitigation or benchmark papers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our rapid review. We address the single major comment below and will revise the manuscript to improve methodological transparency.

read point-by-point responses

Referee: [Methods / rapid review description] The rapid review methodology (abstract and methods section): the process that reduces 350 papers to 18 is stated at a high level but provides no details on search strings, databases, date ranges, inclusion/exclusion criteria, or inter-rater agreement metrics. This is load-bearing for the central claim because the three gaps and the conclusion that the field is 'not yet ready' rest on the assumption that the selected studies accurately represent the state of LLM-enabled MAS fairness work; without a documented protocol it is impossible to rule out systematic under-sampling of mitigation or benchmark papers.

Authors: We agree that the current manuscript describes the reduction from 350 papers to 18 at a high level without sufficient protocol details. As this is a rapid review, the main text was kept concise, but we recognize that this limits assessment of representativeness and supports the referee's point that it is load-bearing for our conclusions. In the revised version we will expand the Methods section with a dedicated protocol subsection that specifies the search strings, databases queried, date ranges, full inclusion/exclusion criteria, and any inter-rater agreement statistics. We will also add a PRISMA-style flow diagram and make the complete search protocol available as supplementary material. These changes will directly address the concern and allow readers to evaluate potential sampling biases. revision: yes

Circularity Check

0 steps flagged

Descriptive literature review exhibits no circularity

full rationale

This rapid review paper contains no mathematical derivations, equations, predictions, fitted parameters, or ansatzes. Its central synthesis of gaps (fragmented evaluation, limited generalization, scarce mitigation) is derived from qualitative analysis of 18 externally sourced studies screened from an initial corpus of 350 papers. No self-citation chains, self-definitional loops, or renaming of known results are present in the provided text or abstract; the methodology and conclusions remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature review with no new mathematical models, empirical measurements, or postulates. No free parameters, axioms, or invented entities are introduced; the central claim rests entirely on the authors' interpretation and selection of existing studies.

pith-pipeline@v0.9.0 · 5558 in / 1156 out tokens · 54572 ms · 2026-05-10T17:07:29.958168+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 46 canonical work pages · 4 internal anchors

[1]

[n. d.]. ACM Digital Library. https://dl.acm.org/. Accessed: 2025-10-11

2025
[2]

[n. d.]. EU Artificial Intelligence Act: Up-to-date developments and analyses. https://artificialintelligenceact.eu/. Accessed: 2025-11-29

2025
[3]

[n. d.]. Google Scholar. https://scholar.google.ca/. Accessed: 2025-10-11

2025
[4]

[n. d.]. IEEE Xplore Digital Library. https://ieeexplore.ieee.org. Accessed: 2025-10-11

2025
[5]

Anthropic. 2024. Introducing the Model Context Protocol. https://www.anthropic. com/news/model-context-protocol. Accessed: April 16, 2026

2024
[6]

Aishwarya Bandaru, Fabian Bindley, Trevor Bluth, Nandini Chavda, Baixu Chen, and Ethan Law. 2025. Revealing Political Bias in LLMs through Structured Multi-Agent Debate. arXiv:2506.11825 [cs.AI] https://arxiv.org/abs/2506.11825

work page arXiv 2025
[7]

Angana Borah and Rada Mihalcea. 2024. Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions. arXiv:2410.02584 [cs.CL] https: //arxiv.org/abs/2410.02584

work page arXiv 2024
[8]

Bruno Cartaxo, Gustavo Pinto, and Sergio Soares. 2020. Rapid Reviews in Soft- ware Engineering. arXiv:2003.10006 [cs.SE] https://arxiv.org/abs/2003.10006

work page arXiv 2020
[9]

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Measuring and Mitigating Identity Bias in Multi-Agent Debate via Anonymization. arXiv:2510.07517 [cs.AI] https://arxiv.org/abs/2510.07517

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Min Choi, Keonwoo Kim, Sungwon Chae, and Sangyeop Baek. 2025. An Empirical Study of Group Conformity in Multi-Agent Systems. InFindings of the Associ- ation for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Compu- tational Linguistics, Vienna, Austria, 5123–5139. do...

work page doi:10.18653/v1/2025.findings- 2025
[11]

Erica Coppolillo, Giuseppe Manco, and Luca Maria Aiello. 2025. Unmasking Conversational Bias in AI Multiagent Systems. arXiv:2501.14844 [cs.CL] https: //arxiv.org/abs/2501.14844

work page arXiv 2025
[12]

Cursor. 2025. Cursor — The AI Code Editor. https://cursor.com/. Accessed: April 16, 2026

2025
[13]

José Antonio Siqueira de Cerqueira, Mamia Agbese, Rebekah Rousi, Nan- nan Xi, Juho Hamari, and Pekka Abrahamsson. 2025. Can We Trust AI Agents? A Case Study of an LLM-Based Multi-Agent System for Ethical AI. arXiv:2411.08881 [cs.CY] https://arxiv.org/abs/2411.08881

work page arXiv 2025
[14]

Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoa...

work page doi:10.18653/v1/2023 2023
[15]

Ariel Flint, Luca Maria Aiello, Romualdo Pastor-Satorras, and Andrea Baronchelli
[16]

Flint, L

Group size effects and collective misalignment in LLM multi-agent systems. arXiv:2510.22422 [cs.MA] https://arxiv.org/abs/2510.22422

work page arXiv
[17]

Gallegos, Ryan A

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed. 2024. Bias and Fairness in Large Language Models: A Survey.Computational Linguistics50, 3 (Sept. 2024), 1097–1179. doi:10.1162/coli_a_00524

work page doi:10.1162/coli_a_00524 2024
[18]

Manuel B. Garcia. 2025. Teaching and learning computer programming using ChatGPT: A rapid review of literature amid the rise of generative AI technologies. Education and Information Technologies30, 12 (2025), 16721–16745. doi:10.1007/ s10639-025-13452-5

2025
[19]

GitHub, Inc. 2025. GitHub Copilot: Your AI Pair Programmer. https://github. com/features/copilot. Accessed: April 16, 2026

2025
[20]

Diego Gosmar and Deborah A. Dahl. 2025. Sentinel Agents for Secure and Trustworthy Agentic AI in Multi-Agent Systems. arXiv:2509.14956 [cs.AI] https: //arxiv.org/abs/2509.14956

work page arXiv 2025
[21]

Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. arXiv:2509.06216 [cs.SE] https://arxiv.org/abs/ 2509.06216

work page arXiv 2025
[22]

Hassan, Gustavo A

Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, and Zhen Ming Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107 [cs.SE] https://arxiv.org/abs/2410.06107

work page arXiv 2024
[23]

Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003

work page doi:10.1145/3712003 2025
[24]

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. 2024. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. arXiv:2308.00352 [cs.AI] https://arxiv.org/abs/2308.00352

work page internal anchor Pith review arXiv 2024
[25]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE] https://arxiv.org/abs/2308.10620

work page arXiv 2024
[26]

Bias testing and mitigation in LLM -based code generation

Dong Huang, Jie M. Zhang, Qingwen Bu, Xiaofei Xie, Junjie Chen, and Hem- ing Cui. 2025. Bias Testing and Mitigation in LLM-based Code Generation. arXiv:2309.14345 [cs.SE] https://arxiv.org/abs/2309.14345

work page arXiv 2025
[27]

International Organization for Standardization (ISO) and IEC. 2023. Information technology — Artificial intelligence — Management system. Published December 2023; 51 pages

2023
[28]

Gabriele Cesar Iwashima, Claudia Susie Rodrigues, Claudio Dipolitto, and Ger- aldo Xexéo. 2025. Factors That Support Grounded Responses in LLM Conversa- tions: A Rapid Review. arXiv:2511.21762 [cs.CL] https://arxiv.org/abs/2511.21762

work page arXiv 2025
[29]

Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2025. From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future. arXiv:2408.02479 [cs.SE] https://arxiv.org/abs/2408.02479

work page arXiv 2025
[30]

Marcin Kawalerowicz, Marcin Pietranik, and Krzysztof Stępniak. 2026. LLMs as Code Review Agents: A Rapid Review and Experimental Evaluation with Hu- man Expert Judges. InComputational Collective Intelligence, Ngoc Thanh Nguyen, Vu Dinh Duc Anh, Adrianna Kozierkiewicz, Sinh Nguyen Van, Manuel Núñez, Jan Treur, and Gottfried Vossen (Eds.). Springer Nature S...

2026
[31]

Rana Nameer Hussain Khan, Dawood Wasif, Jin-Hee Cho, and Ali Butt. 2025. Multi-Agent Code-Orchestrated Generation for Reliable Infrastructure-as-Code. arXiv:2510.03902 [cs.SE] https://arxiv.org/abs/2510.03902

work page arXiv 2025
[32]

Hadas Kotek, Rikker Dockum, and David Sun. 2023. Gender bias and stereotypes in Large Language Models. InProceedings of The ACM Collective Intelligence Conference (CI ’23). ACM, 12–24. doi:10.1145/3582269.3615599

work page doi:10.1145/3582269.3615599 2023
[33]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Corey Yang-Smith, Ronnie de Souza Santos, and Ahmad Abdellatif Reshaping Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/ abs/2507.15003

work page arXiv 2025
[34]

Feng Lin, Dong Jae Kim, and Tse-Husn Chen. 2024. SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents. arXiv:2403.15852 [cs.SE] https://arxiv.org/abs/2403.15852

work page arXiv 2024
[35]

Lin Ling, Fazle Rabbi, Song Wang, and Jinqiu Yang. 2025. Bias Unveiled: Investigating Social Bias in LLM-Generated Code. arXiv:2411.10351 [cs.SE] https://arxiv.org/abs/2411.10351

work page arXiv 2025
[36]

Yan Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian- Guang Lou, Pin-Yu Chen, and Tsung-Yi Ho. 2023. Uncovering and Quantifying Social Biases in Code Generation. arXiv:2305.15377 [cs.CL] https://arxiv.org/ abs/2305.15377

work page arXiv 2023
[37]

Jens Lünstedt and Tim Schlippe. 2025. Mitigating Bias in Large Language Models Leveraging Multi-Agent Scenarios. In2025 7th International Conference on Natural Language Processing (ICNLP). 14–18. doi:10.1109/ICNLP65360.2025.11108428

work page doi:10.1109/icnlp65360.2025.11108428 2025
[38]

Deep Mehta, Kartik Rawool, Subodh Gujar, and Bowen Xu. 2023. Automated DevOps Pipeline Generation for Code Repositories using Large Language Models. arXiv:2312.13225 [cs.SE] https://arxiv.org/abs/2312.13225

work page arXiv 2023
[39]

Imran Mirza, Cole Huang, Ishwara Vasista, Rohan Patil, Asli Akalin, Sean O’Brien, and Kevin Zhu. 2025. MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered. arXiv:2507.01019 [cs.CL] https://arxiv.org/abs/2507.01019

work page arXiv 2025
[40]

n8n. 2025. n8n — AI Workflow Automation Platform & Tools. https://n8n.io/. Accessed: April 16, 2026

2025
[41]

Nguyen, and Nghi D

Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui
[42]

AgileCoder: Dynamic collaborative agents for software development based on agile methodology

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology. arXiv:2406.11912 [cs.SE] https://arxiv.org/abs/ 2406.11912

work page arXiv
[43]

Thi-Nhung Nguyen, Linhao Luo, Thuy-Trang Vu, and Dinh Phung. 2025. The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems. arXiv:2510.10943 [cs.MA] https: //arxiv.org/abs/2510.10943

work page arXiv 2025
[44]

OpenAI. 2025. Introducing AgentKit. https://openai.com/index/introducing- agentkit/. Accessed: April 16, 2026

2025
[45]

Marc Oriol, Quim Motger, Jordi Marco, and Xavier Franch. 2025. Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models . In2025 IEEE 33rd International Requirements Engineering Conference (RE). IEEE Computer Society, Los Alamitos, CA, USA, 527–534. doi:10.1109/ RE63999.2025.00063

work page arXiv 2025
[46]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. 2022. BBQ: A Hand- Built Bias Benchmark for Question Answering. arXiv:2110.08193 [cs.CL] https: //arxiv.org/abs/2110.08193

work page arXiv 2022
[47]

Federica Pepe, Vittoria Nardone, Antonio Mastropaolo, Gabriele Bavota, Gerardo Canfora, and Massimiliano Di Penta. 2024. How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension(Lisbon, Portugal)(ICPC ’24). Association for Computing Machinery,...

work page doi:10.1145/3643916.3644412 2024
[48]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. ChatDev: Communicative Agents for Software Development. arXiv:2307.07924 [cs.SE] https://arxiv.org/abs/2307.07924

work page internal anchor Pith review arXiv 2024
[49]

Shaina Raza, Ranjan Sapkota, Manoj Karkee, and Christos Emmanouilidis. 2025. TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems. arXiv:2506.04133 [cs.AI] https: //arxiv.org/abs/2506.04133

work page arXiv 2025
[50]

Replit. 2025. Replit — Build apps and sites with AI. https://replit.com/. Accessed: April 16, 2026

2025
[51]

2025.Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering

Krishna Ronanki. 2025.Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering. Association for Computing Machinery, New York, NY, USA, 1333–1337. https://doi.org/10.1145/ 3696630.3728717

work page arXiv 2025
[52]

Mark Ryan and Bernd Carsten Stahl. 2020. Artificial intelligence ethics guidelines for developers and users: clarifying their content and normative implications. Journal of Information, Communication and Ethics in Society19, 1 (06 2020), 61–86. arXiv:https://www.emerald.com/jices/article-pdf/19/1/61/1616450/jices- 12-2019-0138.pdf doi:10.1108/JICES-12-2019-0138

work page doi:10.1108/jices-12-2019-0138 2020
[53]

Manish Sanwal. 2025. Layered Chain-of-Thought Prompting for Multi-Agent LLM Systems: A Comprehensive Approach to Explainable Large Language Mod- els. arXiv:2501.18645 [cs.CL] https://arxiv.org/abs/2501.18645

work page arXiv 2025
[54]

Tanush Sharanarthi. 2025. Adaptive Multi-Agent AI Framework for Real-Time Energy Optimization and Context-Aware Code Review in Software Development. In2025 5th International Symposium on Computer Technology and Information Science (ISCTIS). 353–358. doi:10.1109/ISCTIS65944.2025.11066037

work page doi:10.1109/isctis65944.2025.11066037 2025
[55]

Ron Solomon, Yarin Yerushalmi Levi, Lior Vaknin, Eran Aizikovich, Amit Baras, Etai Ohana, Amit Giloni, Shamik Bose, Chiara Picardi, Yuval Elovici, and Asaf Shabtai. 2025. LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems. arXiv:2508.12412 [cs.CR] https://arxiv.org/abs/2508.12412

work page arXiv 2025
[56]

Rao Surapaneni, Miku Jha, Michael Vakoc, and Todd Segal. 2025. Announcing the Agent2Agent Protocol (A2A): A new era of Agent Interoperability. https: //developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/. Ac- cessed: April 16, 2026

2025
[57]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

2017
[58]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

2022
[59]

Zhiqiu Xia, Lang Zhu, Bingzhe Li, Feng Chen, Qiannan Li, Chunhua Liao, Feiyi Wang, and Hang Liu. 2025. Analyzing 16,193 LLM Papers for Fun and Profits. arXiv:2504.08619 [cs.DL] https://arxiv.org/abs/2504.08619

work page arXiv 2025
[60]

Qinghua Xu, Guancheng Wang, Lionel Briand, and Kui Liu. 2025. Hallu- cination to Consensus: Multi-Agent LLMs for End-to-End Test Generation. arXiv:2506.02943 [cs.SE] https://arxiv.org/abs/2506.02943

work page arXiv 2025
[61]

Zhenjie Xu, Wenqing Chen, Yi Tang, Xuanying Li, Cheng Hu, Zhixuan Chu, Kui Ren, Zibin Zheng, and Zhichao Lu. 2025. Mitigating Social Bias in Large Lan- guage Models: A Multi-Objective Approach Within a Multi-Agent Framework. Proceedings of the AAAI Conference on Artificial Intelligence39, 24 (Apr. 2025), 25579–25587. doi:10.1609/aaai.v39i24.34748

work page doi:10.1609/aaai.v39i24.34748 2025
[62]

Bissyandé, Yang Liu, and Haoye Tian

Boyang Yang, Zijian Cai, Feng Liu, Bach Le, Lingming Zhang, Tégawendé F. Bissyandé, Yang Liu, and Haoye Tian. 2025. A Survey of LLM-based Auto- mated Program Repair: Taxonomies, Design Paradigms, and Applications.ArXiv abs/2506.23749 (2025). https://api.semanticscholar.org/CorpusID:280010745

work page arXiv 2025
[63]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[64]

Gerding, Sebastian Stein, Corina Cirstea, M

Vahid Yazdanpanah, Enrico H. Gerding, Sebastian Stein, Corina Cirstea, M. C. Schraefel, Timothy J. Norman, and Nicholas R. Jennings. 2021. Different Forms of Responsibility in Multiagent Systems: Sociotechnical Characteristics and Requirements.IEEE Internet Computing25, 6 (2021), 15–22. doi:10.1109/MIC. 2021.3107334

work page doi:10.1109/mic 2021
[65]

2025.PATCHAGENT: a practical program repair agent mimicking human expertise

Zheng Yu, Ziyi Guo, Yuhang Wu, Jiahao Yu, Meng Xu, Dongliang Mu, Yan Chen, and Xinyu Xing. 2025.PATCHAGENT: a practical program repair agent mimicking human expertise. USENIX Association, USA

2025
[66]

Zapier Inc. 2025. Zapier: Automate AI Workflows, Agents, and Apps. https: //zapier.com/. Accessed: April 16, 2026

2025
[67]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023