arxiv: 2604.10470 · v1 · submitted 2026-04-12 · 💻 cs.CL · cs.AI

Recognition: unknown

From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation

Mingfei Lu , Yi Zhang , Mengjia Wu , Yue Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal consultationmulti-agent frameworklegal element graphquestion answeringstructured reasoningChinese legal AItask decompositionLLM evaluation

0 comments

The pith

A multi-agent framework with legal element graphs outperforms general and legal LLMs on Chinese consultation QA tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles legal consultation question answering by building JurisCQAD, a dataset of more than 43,000 real-world Chinese queries paired with expert-validated responses. It decomposes each query into a legal element graph that links entities, events, intents, and issues. A modular multi-agent system called JurisMA then applies dynamic routing, statutory grounding, and stylistic refinement to produce responses. When trained on the new dataset and tested on a refined LawBench, the approach yields higher lexical and semantic scores than both general-purpose and domain-specific language models. The results indicate that interpretable decomposition and agent collaboration help manage the contextual dependencies typical in legal advice.

Core claim

Converting legal queries into legal element graphs that integrate entities, events, intents, and legal issues, then processing them via a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization, produces more accurate and context-aware consultation responses than standard LLMs after training on JurisCQAD.

What carries the argument

The legal element graph, which integrates entities, events, intents, and legal issues to capture dependencies across facts, norms, and procedures, guiding multi-agent collaboration.

If this is right

Better handling of complex contextual dependencies in legal facts, norms, and procedures.
Higher performance on lexical and semantic evaluation metrics for legal consultation.
More interpretable reasoning through explicit decomposition and modular agent steps.
Improved statutory grounding and response style via specialized routing and optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph decomposition approach could be tested on non-Chinese legal systems or adjacent domains such as medical consultation.
Connecting the agents to live statutory databases would likely strengthen the grounding component further.
The dataset construction method could be reused to create similar resources for other languages or legal traditions.

Load-bearing premise

Expert-validated positive and negative responses accurately represent high-quality legal advice, and the graph decomposition plus multi-agent steps capture all relevant dependencies without introducing new errors.

What would settle it

Evaluation on a held-out set of legal queries where the multi-agent system shows no gain in semantic metrics or produces more factual inaccuracies than a single legal-domain LLM.

Figures

Figures reproduced from arXiv: 2604.10470 by Mengjia Wu, Mingfei Lu, Yi Zhang, Yue Feng.

**Figure 2.** Figure 2: Overview of JURISMA, a multi-agent framework that parses legal queries into element graphs, refines drafts via agent collaboration, and outputs a final legal opinion with supporting statutes. data, leading to domain shift. We address this gap by constructing a 43K-scale dataset of real-world legal consultations with expert-verified triplet annotations, covering high-frequency domains and supporting robust… view at source ↗

**Figure 3.** Figure 3: Rouge-L and BertScore comparison before and after DPO across Qwen2.5 models (3B/7B/14B) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Case study comparing model-generated responses to a time-sensitive legal query. The example illustrates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 43k Chinese legal dataset plus multi-agent graph framework, but evaluation details and validation strength need checking.

read the letter

The main thing here is a new dataset of over 43,000 real Chinese legal queries with expert positive and negative responses, plus the JurisMA multi-agent system that turns each query into a legal element graph covering entities, events, intents, and issues, then routes through modular agents with statutory grounding and dynamic decisions. That combination is the concrete step forward. The dataset fills a gap in high-quality, annotated legal consultation data for Chinese, and the framework tries to make the reasoning steps explicit rather than leaving everything inside one LLM call. The modular design with stylistic optimization is a reasonable response to the contextual dependencies that standard models often miss. They evaluate on a refined LawBench and report gains over both general and legal-domain LLMs on lexical and semantic metrics, which at least gives a starting point for comparison. The soft spots sit in the validation and results sections. Expert validation is claimed, but without inter-annotator agreement numbers, clear guidelines, or disagreement resolution details it is hard to gauge how reliable the positive/negative pairs actually are as supervision. The graph decomposition is load-bearing, yet the abstract and available description give little on error analysis or ablations that isolate whether it reduces dependency failures or sometimes adds new ones. The outperformance claim would be stronger with exact scores, baseline descriptions, and statistical tests rather than the high-level statement. This work is for people already working on legal-domain QA, multi-agent decomposition, or Chinese NLP resources. A reader who needs a new benchmark or wants to test structured routing in a high-stakes domain can pull the dataset and try the framework directly. It is coherent enough and brings enough new material that it deserves a serious referee rather than a desk reject, though any review should focus on the data quality checks and the robustness of the reported gains.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses. It proposes a legal element graph that decomposes each query into entities, events, intents, and legal issues, along with the JurisMA multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Trained on JurisCQAD and evaluated on a refined LawBench, the system is claimed to significantly outperform both general-purpose and legal-domain LLMs on lexical and semantic metrics, illustrating the benefits of interpretable decomposition and modular collaboration for Legal CQA.

Significance. If the empirical claims hold under rigorous scrutiny, the work provides a substantial new resource in JurisCQAD for legal consultation QA and demonstrates how graph-based decomposition combined with multi-agent collaboration can address complex contextual dependencies in a high-stakes domain. The emphasis on modularity and interpretability is a strength that could inform future domain-specific systems, particularly where factual grounding and procedural logic matter.

major comments (3)

[§3] §3 (Dataset Construction): The expert validation of the 43k positive/negative response pairs is presented without inter-annotator agreement figures, annotation guidelines, or disagreement-resolution procedures. This detail is load-bearing for the claim that JurisCQAD supplies reliable high-quality supervision.
[§5–6] §5–6 (Framework and Experiments): No ablation studies isolate the contribution of the legal element graph construction from the multi-agent routing components, and no error analysis examines whether the graph decomposition introduces new factual or procedural errors. Without these, the attribution of metric gains to the proposed methods remains unverified.
[§6] §6 (Evaluation): The results section must report concrete numerical values, baseline details, statistical significance tests, and exclusion criteria for the claimed outperformance on the refined LawBench; the abstract alone supplies none of these.

minor comments (2)

[§4] A formal diagram or pseudocode for the legal element graph construction would improve clarity in §4.
[Throughout] Ensure consistency in terminology between 'legal element graph' and 'JurisMA' across sections and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and will revise the paper to address the concerns regarding dataset documentation, experimental ablations, error analysis, and result reporting. Below we respond point by point.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The expert validation of the 43k positive/negative response pairs is presented without inter-annotator agreement figures, annotation guidelines, or disagreement-resolution procedures. This detail is load-bearing for the claim that JurisCQAD supplies reliable high-quality supervision.

Authors: We agree that explicit documentation of the annotation process is necessary to support claims of dataset quality. In the revised manuscript we will add a new subsection to §3 that (i) reproduces the annotation guidelines given to legal experts, (ii) describes the disagreement-resolution protocol (two-expert review followed by senior adjudicator), and (iii) reports inter-annotator agreement statistics (Cohen’s κ and raw agreement) computed on a 5 % stratified sample of the 43 k pairs. These additions will be placed before the dataset statistics table. revision: yes
Referee: [§5–6] §5–6 (Framework and Experiments): No ablation studies isolate the contribution of the legal element graph construction from the multi-agent routing components, and no error analysis examines whether the graph decomposition introduces new factual or procedural errors. Without these, the attribution of metric gains to the proposed methods remains unverified.

Authors: We accept that the current experimental design does not fully disentangle the contributions of the legal element graph and the multi-agent routing. We will add two new ablation tables in §5: one that removes the graph construction module while keeping the agents, and another that disables dynamic routing while retaining the graph. In addition, we will insert an error-analysis subsection that manually inspects 200 failure cases, categorizes errors attributable to graph decomposition (factual hallucination, missed legal issue, incorrect event linking), and quantifies their downstream effect on final answer quality. These results will be reported alongside the main experiments in the revised §6. revision: yes
Referee: [§6] §6 (Evaluation): The results section must report concrete numerical values, baseline details, statistical significance tests, and exclusion criteria for the claimed outperformance on the refined LawBench; the abstract alone supplies none of these.

Authors: We acknowledge that the present version of §6 and the abstract lack the required quantitative detail. The revised manuscript will expand §6 with (i) full numerical tables for all lexical and semantic metrics on the refined LawBench, (ii) explicit baseline specifications (model names, parameter counts, fine-tuning regimes), (iii) statistical significance results (paired t-tests and McNemar’s test with p-values and confidence intervals), and (iv) a clear statement of exclusion criteria applied to the test set. The abstract will be updated to cite the key absolute improvements (e.g., +X BLEU, +Y F1). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and framework evaluation

full rationale

The paper's core contribution is the construction of JurisCQAD (43k expert-annotated queries) and the JurisMA multi-agent framework with legal element graph decomposition, followed by empirical training and evaluation on refined LawBench showing metric gains over baselines. No equations, first-principles derivations, or predictions are claimed; performance claims rest on direct measurement rather than any reduction to fitted inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Central claims rest on the reliability of expert annotations and the utility of the proposed decomposition and agent collaboration; no explicit free parameters are stated in the abstract.

axioms (2)

domain assumption Expert-validated positive and negative responses provide reliable training and evaluation signals for legal consultation quality.
Dataset construction and performance claims depend directly on this.
domain assumption Converting queries into a legal element graph of entities, events, intents, and issues captures the necessary contextual dependencies.
This decomposition is presented as the foundation for the multi-agent reasoning.

invented entities (2)

Legal element graph no independent evidence
purpose: Integrates entities, events, intents, and legal issues to enable context-aware reasoning.
Core representational innovation described in the abstract.
JurisMA multi-agent framework no independent evidence
purpose: Supports dynamic routing, statutory grounding, and stylistic optimization via modular collaboration.
Proposed system architecture.

pith-pipeline@v0.9.0 · 5474 in / 1457 out tokens · 45034 ms · 2026-05-10T15:52:08.213500+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

Chin-Yew Lin

Lexilaw: A scalable legal language model for comprehensive legal understanding. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81. Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis

2004
[2]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang

Interpretable long-form legal question answer- ing with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22266–22275. Mingfei Lu, Mengjia Wu, Feng Liu, Jiawei Xu, Weikai Li, Haoyang Wang, Zhengdong Hu, Ying Ding, Yizhou Sun, Jie Lu, and 1 others. 2026. Choosing how to remember: Ad...

work page arXiv 2026
[3]

Bleurt: Learning robust metrics for text gener- ation

Bleurt: Learning robust metrics for text gener- ation.arXiv preprint arXiv:2004.04696. Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken Satoh, Min Zhang, and Shaoping Ma. 2020. Bert-pli: Modeling paragraph-level interactions for legal case retrieval. InIJCAI, volume 2020, pages 3501–3507. Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, and Kun Ku...

work page arXiv 2004
[4]

entities

Lawgpt: A chinese legal knowledge-enhanced large language model.Preprint, arXiv:2406.04614. A More Details for Experimental Setup A.1 JurisCQAD Dataset Details The core structure and statistics of the JurisCQAD dataset are summarized in Table 1. Property Description Source Real-world legal consulta- tion platforms Language Chinese Size 43,126 triplets Dat...

work page arXiv 2011