pith. sign in

arxiv: 2606.30906 · v1 · pith:UNRO2FZ2new · submitted 2026-06-29 · 💻 cs.AI

Investigating Multi-Agent Deliberation in Law

Pith reviewed 2026-07-01 01:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent deliberationlegal reasoninglarge language modelsAI agentscourtroom procedureslegal argumentationbenchmark evaluation
0
0 comments X

The pith

Multi-agent deliberation frameworks match single LLMs on legal tasks but generate distinct answers that solve different cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests multi-agent deliberation methods for legal reasoning tasks with large language models, including two new frameworks modeled on courtroom procedures and legal argumentation. On both legal and non-legal benchmarks the multi-agent setups reach overall accuracy levels similar to standard single-model baselines. At the same time they produce markedly different answers, succeeding on some problems the baselines miss and failing on others the baselines handle. The distinction stands out most clearly on questions that call for weighing multiple perspectives. Readers would care because the results indicate multi-agent systems could serve as a practical complement to existing AI tools in law rather than a direct replacement.

Core claim

Multi-agent deliberation methods for legal reasoning using LLMs achieve comparable performance to baseline large language models but produce significantly distinct answers. These approaches can successfully solve cases that the baseline fails to address and vice versa. Qualitative evaluation shows multi-agent frameworks appear better suited for answering questions that require critical thinking from multiple perspectives.

What carries the argument

Multi-agent deliberation (MAD) frameworks, including two novel ones inspired by courtroom procedures and legal argumentation, that generate and compare multiple reasoning paths before reaching an answer.

If this is right

  • Multi-agent systems can address legal cases that single large language models cannot.
  • They show advantages on tasks that require critical thinking from multiple perspectives.
  • Law-inspired multi-agent designs offer a viable direction for AI tools in the legal domain.
  • The distinct answer sets suggest multi-agent methods can complement rather than duplicate monolithic approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing outputs from multi-agent and single-model systems could increase the total number of cases solved without raising overall error rates.
  • The observed answer differences may point to underlying limits in how single models explore alternative legal interpretations.
  • The same deliberation pattern could be tested in other high-stakes domains that value multiple viewpoints, such as medical diagnosis or policy analysis.

Load-bearing premise

The performance differences and qualitative advantages seen in benchmarks will continue to appear under real-world legal conditions with varied case complexity and adversarial inputs.

What would settle it

A side-by-side test of the multi-agent frameworks and baseline models on a fresh collection of actual court cases that includes adversarial arguments and domain-specific constraints, checking whether the complementary success rates and critical-thinking edge persist.

Figures

Figures reproduced from arXiv: 2606.30906 by Cor Steging, Ludi van Leeuwen, Tadeusz Zbiegie\'n.

Figure 1
Figure 1. Figure 1: The pipelines of each of the frameworks we evaluate. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example question 869 from the Privacy dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of the multi-agent frameworks against baseline across the five benchmark datasets. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Number of questions in which the framework dis [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Answers to example question (Figure 2) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts used in the Multi-Agend Deliberation (MAD) system. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompts used in the 3-Ply system [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompts used in the Parrots system [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Artificial Intelligence is increasingly applied to the field of law, and has the potential to increase access to justice. One particular movement that is gaining traction is that of agentic AI, wherein AI agents, based on Large Language Models (LLMs) can take autonomous actions. In particular, multi-agent approaches in the legal domain remain largely unexplored. In this paper, we investigate multi-agent deliberation methods for legal reasoning tasks using LLMs. We explore multi-agent deliberation (MAD) and introduce two novel multi-agent frameworks inspired by courtroom procedures and legal argumentation. Our experiments on both legal and non-legal benchmarks reveal that multi-agent frameworks achieve comparable overall performance to baseline large language models, but produce significantly distinct answers. Notably, these approaches can successfully solve cases that the baseline fails to address, and vice versa. We conduct a qualitative evaluation and highlight scenarios where multi-agent frameworks outperform monolithic approaches. For example, multi-agent approaches appear better suited for answering questions that require critical thinking from multiple perspectives. Our work positions multi-agent systems as a promising direction for AI in the legal domain, while demonstrating the potential of law-inspired multi-agent approaches for deliberation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript investigates multi-agent deliberation (MAD) for legal reasoning tasks with LLMs. It introduces MAD along with two novel frameworks modeled on courtroom procedures and legal argumentation. Experiments on legal and non-legal benchmarks indicate that the multi-agent approaches achieve comparable overall performance to baseline LLMs while producing significantly distinct answers, with each solving some cases the other misses. A qualitative evaluation is presented to highlight advantages in critical-thinking scenarios requiring multiple perspectives.

Significance. If the reported empirical patterns hold, the work usefully demonstrates complementarity between single-LLM and multi-agent deliberation on the tested benchmarks, which could inform ensemble strategies in legal AI. The law-inspired framework designs add a domain-motivated angle to agentic systems. The manuscript does not claim real-world deployment or generalization beyond the benchmarks, so the scope remains appropriately bounded.

major comments (1)
  1. [Abstract] Abstract: the central claim that multi-agent frameworks 'achieve comparable overall performance' and 'produce significantly distinct answers' is stated without any numerical accuracy values, benchmark identifiers, definitions of 'distinct,' sample sizes, or statistical comparisons. Because this empirical observation is the load-bearing result, the results section (or a new table) must supply the concrete metrics, error bars, and exact task definitions that allow readers to assess whether the comparability and complementarity claims are supported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that multi-agent frameworks 'achieve comparable overall performance' and 'produce significantly distinct answers' is stated without any numerical accuracy values, benchmark identifiers, definitions of 'distinct,' sample sizes, or statistical comparisons. Because this empirical observation is the load-bearing result, the results section (or a new table) must supply the concrete metrics, error bars, and exact task definitions that allow readers to assess whether the comparability and complementarity claims are supported.

    Authors: We agree that the abstract would be strengthened by including concrete metrics and that the results presentation should make the supporting numbers fully transparent. The manuscript reports accuracy values, benchmark names (legal and non-legal), and the definition of distinct answers (cases solved by one approach but missed by the other) in the results section, along with sample sizes. To address the concern directly, we will revise the abstract to include key numerical accuracy figures and benchmark identifiers, and we will add or expand a summary table in the results section that includes error bars (from repeated runs where applicable) and any statistical comparisons. This change will be made in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical comparison

full rationale

The paper reports benchmark experiments comparing multi-agent deliberation frameworks against baseline LLMs on legal and non-legal tasks. The central claim (comparable aggregate accuracy with distinct answer sets, each solving cases the other misses) is an observed empirical outcome, not a derivation, prediction, or fitted quantity that reduces to its own inputs. No equations, uniqueness theorems, ansatzes, or self-citations are invoked as load-bearing steps in any derivation chain. The work is self-contained against external benchmarks and does not rename known results or smuggle assumptions via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents full enumeration; typical LLM assumptions such as effective role prompting and deliberation convergence are implicit but unstated.

axioms (1)
  • domain assumption LLM agents can be effectively prompted to perform distinct roles in deliberation
    Foundational to the multi-agent setup described.

pith-pipeline@v0.9.1-grok · 5726 in / 1077 out tokens · 33697 ms · 2026-07-01T01:40:04.564098+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Abdelrahman Abdallah, Bhawna Piryani, and Adam Jatowt. 2023. Exploring the state of the art in legal QA systems.Journal of Big Data10, 1 (Aug. 2023). doi:10.1186/s40537-023-00802-8

  2. [2]

    D. B. Acharya, K. Kuppan, and B. Divya. 2025. Agentic AI: Autonomous in- telligence for complex goals—A comprehensive survey.IEEe Access13 (2025), 18912–18936

  3. [3]

    G. Amati. 2009.BM25. Springer US, Boston, MA, 257–260

  4. [4]

    Arisaka, J

    R. Arisaka, J. Dauphin, K. Satoh, and L. van der Torre. 2022. Multi-agent Argu- mentation and Dialogue.IfCoLog Journal of Logics and Their Applications9, 4 (2022), 853–886

  5. [5]

    1988.Modelling Legal Argument: Reasoning with Cases and Hypo- theticals

    Kevin Ashley. 1988.Modelling Legal Argument: Reasoning with Cases and Hypo- theticals. PhD Thesis. University of Massachusetts, Amherst

  6. [6]

    Sourav Banerjee, Ayushi Agarwal, and Saloni Singla. 2025. LLMs Will Always Hallucinate, and We Need to Live with This. InIntelligent Systems and Applications, Kohei Arai (Ed.). Springer Nature Switzerland, Cham, 624–648

  7. [7]

    Trevor Bench-Capon, Katie Atkinson, Floris Bex, Henry Prakken, and Bart Verheij

  8. [8]

    Computational Models of Legal Argument.IfCoLoG Journal of Logics and their Applications12(3) (2025), 323–425

  9. [9]

    Trevor Bench-Capon and Giovanni Sartor. 2003. A model of legal reasoning with cases incorporating theories and values.Artificial Intelligence150, 1-2 (Nov. 2003), 97–143. doi:10.1016/S0004-3702(03)00108-5

  10. [10]

    Trevor Bench-Capon and Marek Sergot. 1988. Towards a Rule-Based Repre- sentation of Open Texture in Law.Computer Power and Legal Language(1988), 81–88

  11. [11]

    Strandburg

    Sebastian Benthall and Katherine J. Strandburg. 2021. Agent-Based Modeling as a Legal Theory Tool.Frontiers in Physics9 (2021). doi:10.3389/fphy.2021.666386

  12. [12]

    Prathamesh Devadiga, Omkaar Jayadev Shetty, and Pooja Agarwal. 2025. SAM- VAD: A Multi-Agent System for Simulating Judicial Deliberation Dynamics in India. arXiv:2509.03793 [cs.MA] https://arxiv.org/abs/2509.03793

  13. [13]

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch

  14. [14]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325(2023). AIDA2J, June 08, 2026, Singapore Steging et al

  15. [15]

    Ellsworth

    P.C. Ellsworth. 2005.Legal reasoning. Cambridge University Press, New York, 685–704

  16. [16]

    Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmark- ing Legal Reasoning on 340 Law Exams. arXiv:2505.12864 [cs.CL] https...

  17. [17]

    Eveline T. Feteris. 1997. A Survey of 25 Years of Research on Legal Argumentation. Argumentation11, 3 (1997), 355–376. doi:10.1023/A:1007794830151

  18. [18]

    Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Juliano Rabelo, Ken Satoh, and Masaharu Yoshioka. 2024. Overview of benchmark datasets and methods for the legal information extraction/entailment competition (COLIEE) 2024. InJSAI International Symposium on Artificial Intelligence. Springer, 109–124

  19. [19]

    Gordon, and Michał Araszkiewicz

    Guido Governatori, Trevor Bench-Capon, Bart Verheij, Giovanni Sartor, Adam Wyner, Matthias Grabmair, Katie Atkinson, Henry Prakken, Floris Bex, Thomas F. Gordon, and Michał Araszkiewicz. 2022. Thirty years of Artificial Intelligence and Law: the first decade.Artificial Intelligence and Law30, 4 (2022), 481–519

  20. [20]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. C...

  21. [21]

    Hart and L Green

    H.L.A. Hart and L Green. 2012.The concept of law(3rd edition ed.). Oxford University Press, Oxford

  22. [22]

    Zhitao He, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2024. AgentsCourt: Building Judicial Decision-Making Agents with Court Debate Simulation and Legal Knowl- edge Augmentation. arXiv:2403.02959 [cs.CL] https://arxiv.org/abs/2403.02959

  23. [23]

    Nils Holzenberger and Benjamin Van Durme. 2021. Factoring statutory reasoning as language understanding challenges.arXiv preprint arXiv:2105.07903(2021)

  24. [24]

    Cong Jiang and Xiaolei Yang. 2024. Agents on the Bench: Large Lan- guage Model Based Multi Agent Framework for Trustworthy Digital Justice. arXiv:2412.18697 [cs.AI] https://arxiv.org/abs/2412.18697

  25. [25]

    Jiang and X

    C. Jiang and X. Yang. 2025. AgentsBench: A Multi-Agent LLM Simulation Frame- work for Legal Judgment Prediction.Systems13, 8 (2025), 641. doi:10.3390/ systems13080641

  26. [26]

    Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and Philip S. Yu. 2024. Large language models in law: A survey.AI Open5 (2024), 181–196. doi:10.1016/j. aiopen.2024.09.002

  27. [27]

    Hanmeng Liu, Jian Liu, Leyang Cui, Zhiyang Teng, Nan Duan, Ming Zhou, and Yue Zhang. 2023. LogiQA 2.0—An Improved Dataset for Logical Reasoning in Natural Language Understanding.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2023), 2947–2962. doi:10.1109/TASLP.2023.3293046

  28. [28]

    Jorge Martinez-Gil. 2023. A survey on legal question–answering systems.Com- puter Science Review48 (2023), 100552. doi:10.1016/j.cosrev.2023.100552

  29. [29]

    Nicolas Maudet, Simon Parsons, and Iyad Rahwan. 2007. Argumentation in Multi-Agent Systems: Context and Recent Developments. InArgumentation in Multi-Agent Systems: Third International Workshop, ArgMAS 2006, Hakodate, Japan, May 8, 2006. Revised Selected and Invited Papers, Nicolas Maudet, Simon Parsons, and Iyad Rahwan (Eds.). Lecture Notes in Computer S...

  30. [30]

    Izquierdo, and Giovanni Sartor

    Eunate Mayor, Luis R. Izquierdo, and Giovanni Sartor. 2009. Nice and Nasty Lawyers, is the Legal System to Blame? Agent-based simulation insights. In Proceedings of the 2009 Conference on Legal Knowledge and Information Systems: JURIX 2009: The Twenty-Second Annual Conference. IOS Press, NLD, 126–135

  31. [31]

    E. Musi, N. Kokciyan, K. Al-Khatib, D. Ceolin, E. Dietz, K. Gutekunst, A. Hautli- Janisz, C.M. Santibañez Yañez, J. Schneider, J. Scholz, C. Steging, J. Visser, and H. Wachsmuth. 2025. Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design. InThe 12th Workshop on Argument Mining, co-located with ACL 2025. Vienna, Austria

  32. [32]

    Henry Prakken and Giovanni Sartor. 2015. Law and logic: A review from an argumentation perspective.Artificial Intelligence227 (2015), 214–245. doi:10. 1016/j.artint.2015.06.005

  33. [33]

    E. L. Rissland and K. D. Ashley. 1987. A case-based system for trade secrets law. InProceedings of the first international conference on Artificial intelligence and law - ICAIL ’87. ACM Press, Boston, Massachusetts, United States, 60–66. doi:10.1145/41735.41743

  34. [34]

    Antonino Rotolo and Giovanni Sartor. 2023. Argumentation and explanation in the law.Frontiers in Artificial Intelligence6 (2023), 1130559. doi:10.3389/frai.2023. 1130559

  35. [35]

    Gordon, Matthias Grabmair, and Henry Prakken

    Giovanni Sartor, Michał Araszkiewicz, Katie Atkinson, Adam Wyner, Trevor Bench-Capon, Guido Governatori, Bart Verheij, Floris Bex, Thomas F. Gordon, Matthias Grabmair, and Henry Prakken. 2022. Thirty years of Artificial Intelli- gence and Law: the second decade.Artificial Intelligence and Law30, 4 (2022), 521–557

  36. [36]

    Savelka and K.D

    J. Savelka and K.D. Ashley. 2023. The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts.Frontiers in Artificial Intelligence6 (2023)

  37. [37]

    Alex Schwartz. 2019. An agent-based model of judicial power.Journal of Law9, 1 (2019), 21–53

  38. [38]

    Alex Schwartz. 2020. Agent-Based Modeling for Legal Studies. InComputational Legal Studies: The Promise and Challenge of Data-Driven Research, Ryan Whalen (Ed.). Edward Elgar Publishing, Cheltenham, UK and Northampton, MA, USA, Chapter 14, 312–327. doi:10.4337/9781788977456.00019

  39. [39]

    Marco Siino, Mariana Falco, Daniele Croce, and Paolo Rosso. 2025. Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP Ap- proaches.IEEE Access13 (2025), 18253–18276. doi:10.1109/ACCESS.2025.3533217

  40. [40]

    Steenhuis, B

    Q. Steenhuis, B. Willey, and D. Colarusso. 2023. Beyond Readability with Rate- MyPDF: A Combined Rule-based and Machine Learning Approach to Improving Court Forms. InProceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 287–296

  41. [41]

    Cor Steging, Silja Renooij, and Bart Verheij. 2026. Parameterized Argumentation- based Reasoning Tasks for Benchmarking Generative Language Models. InPro- ceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 455–459. doi:10.1145/3769126.3769230

  42. [42]

    Bart Verheij. 2016. Formalizing value-guided argumentation for ethical systems design.Artificial Intelligence and Law24, 4 (Dec. 2016), 387–407. doi:10.1007/ s10506-016-9189-y

  43. [43]

    Gordon, Matthias Grabmair, Henry Prakken, Giovanni Sartor, Bart Verheij, and Adam Wyner

    Serena Villata, Michał Araszkiewicz, Kevin Ashley, Katie Atkinson, Trevor Bench- Capon, Floris Bex, Guido Governatori, Thomas F. Gordon, Matthias Grabmair, Henry Prakken, Giovanni Sartor, Bart Verheij, and Adam Wyner. 2022. Thirty years of Artificial Intelligence and Law: the third decade.Artificial Intelligence and Law30, 4 (2022), 561–591

  44. [44]

    Li Zhang and Kevin D. Ashley. 2025. Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation. arXiv:2506.02992 [cs.AI] https://arxiv.org/abs/2506.02992

  45. [45]

    yes" and the defendant always argues

    Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel R Reidenberg, N Cameron Russell, and Norman Sadeh. 2019. Maps: Scaling privacy compliance analysis to a million apps.Proc. Priv. Enhancing Tech. 2019 (2019), 66. Investigating Multi-Agent Deliberation in Law AIDA2J, June 08, 2026, Singapore Appendix A Prompts This secti...

  46. [46]

    Identify which points are logically sound and relevant to the question

  47. [47]

    Identify which points are weak, irrelevant, or flawed

  48. [48]

    Determine whether the arguments reinforce or contradict each other

  49. [49]

    Treat the arguments as supplemental reasoning, not as new facts

    Use this analysis to refine your final decision. Treat the arguments as supplemental reasoning, not as new facts. Base your verdict primarily on the question. Explicitly reason step by step and conclude strictly with: Answer: yes or Answer: no (g) Prompt for the final decision by the judge. Figure 7: Prompts used in the 3-Ply system. Investigating Multi-A...

  50. [50]

    Give a summary of your reasoning

  51. [51]

    Restate your current stance clearly in the form: Answer: Yes or Answer: No

  52. [52]

    Continue: no

    On a new line, decide if you want to continue the conversation: Continue: yes (default — if there are still open questions or doubts OR if not all parrots have spoken yet.) Continue: no (if you are confident in your stance OR the parrots are only repeating points you have already addressed). If you choose "Continue: no", briefly explain why further discus...