pith. machine review for the scientific record. sign in

arxiv: 2604.10470 · v1 · submitted 2026-04-12 · 💻 cs.CL · cs.AI

Recognition: unknown

From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal consultationmulti-agent frameworklegal element graphquestion answeringstructured reasoningChinese legal AItask decompositionLLM evaluation
0
0 comments X

The pith

A multi-agent framework with legal element graphs outperforms general and legal LLMs on Chinese consultation QA tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles legal consultation question answering by building JurisCQAD, a dataset of more than 43,000 real-world Chinese queries paired with expert-validated responses. It decomposes each query into a legal element graph that links entities, events, intents, and issues. A modular multi-agent system called JurisMA then applies dynamic routing, statutory grounding, and stylistic refinement to produce responses. When trained on the new dataset and tested on a refined LawBench, the approach yields higher lexical and semantic scores than both general-purpose and domain-specific language models. The results indicate that interpretable decomposition and agent collaboration help manage the contextual dependencies typical in legal advice.

Core claim

Converting legal queries into legal element graphs that integrate entities, events, intents, and legal issues, then processing them via a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization, produces more accurate and context-aware consultation responses than standard LLMs after training on JurisCQAD.

What carries the argument

The legal element graph, which integrates entities, events, intents, and legal issues to capture dependencies across facts, norms, and procedures, guiding multi-agent collaboration.

If this is right

  • Better handling of complex contextual dependencies in legal facts, norms, and procedures.
  • Higher performance on lexical and semantic evaluation metrics for legal consultation.
  • More interpretable reasoning through explicit decomposition and modular agent steps.
  • Improved statutory grounding and response style via specialized routing and optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph decomposition approach could be tested on non-Chinese legal systems or adjacent domains such as medical consultation.
  • Connecting the agents to live statutory databases would likely strengthen the grounding component further.
  • The dataset construction method could be reused to create similar resources for other languages or legal traditions.

Load-bearing premise

Expert-validated positive and negative responses accurately represent high-quality legal advice, and the graph decomposition plus multi-agent steps capture all relevant dependencies without introducing new errors.

What would settle it

Evaluation on a held-out set of legal queries where the multi-agent system shows no gain in semantic metrics or produces more factual inaccuracies than a single legal-domain LLM.

Figures

Figures reproduced from arXiv: 2604.10470 by Mengjia Wu, Mingfei Lu, Yi Zhang, Yue Feng.

Figure 1
Figure 1. Figure 1: An illustrative example of legal consultation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of JURISMA, a multi-agent framework that parses legal queries into element graphs, refines drafts via agent collaboration, and outputs a final legal opinion with supporting statutes. data, leading to domain shift. We address this gap by constructing a 43K-scale dataset of real-world legal consultations with expert-verified triplet an￾notations, covering high-frequency domains and supporting robust… view at source ↗
Figure 3
Figure 3. Figure 3: Rouge-L and BertScore comparison before and after DPO across Qwen2.5 models (3B/7B/14B) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study comparing model-generated responses to a time-sensitive legal query. The example illustrates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Legal consultation question answering (Legal CQA) presents unique challenges compared to traditional legal QA tasks, including the scarcity of high-quality training data, complex task composition, and strong contextual dependencies. To address these, we construct JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses, and design a structured task decomposition that converts each query into a legal element graph integrating entities, events, intents, and legal issues. We further propose JurisMA, a modular multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Combined with the element graph, the framework enables strong context-aware reasoning, effectively capturing dependencies across legal facts, norms, and procedural logic. Trained on JurisCQAD and evaluated on a refined LawBench, our system significantly outperforms both general-purpose and legal-domain LLMs across multiple lexical and semantic metrics, demonstrating the benefits of interpretable decomposition and modular collaboration in Legal CQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces JurisCQAD, a large-scale dataset of over 43,000 real-world Chinese legal queries annotated with expert-validated positive and negative responses. It proposes a legal element graph that decomposes each query into entities, events, intents, and legal issues, along with the JurisMA multi-agent framework supporting dynamic routing, statutory grounding, and stylistic optimization. Trained on JurisCQAD and evaluated on a refined LawBench, the system is claimed to significantly outperform both general-purpose and legal-domain LLMs on lexical and semantic metrics, illustrating the benefits of interpretable decomposition and modular collaboration for Legal CQA.

Significance. If the empirical claims hold under rigorous scrutiny, the work provides a substantial new resource in JurisCQAD for legal consultation QA and demonstrates how graph-based decomposition combined with multi-agent collaboration can address complex contextual dependencies in a high-stakes domain. The emphasis on modularity and interpretability is a strength that could inform future domain-specific systems, particularly where factual grounding and procedural logic matter.

major comments (3)
  1. [§3] §3 (Dataset Construction): The expert validation of the 43k positive/negative response pairs is presented without inter-annotator agreement figures, annotation guidelines, or disagreement-resolution procedures. This detail is load-bearing for the claim that JurisCQAD supplies reliable high-quality supervision.
  2. [§5–6] §5–6 (Framework and Experiments): No ablation studies isolate the contribution of the legal element graph construction from the multi-agent routing components, and no error analysis examines whether the graph decomposition introduces new factual or procedural errors. Without these, the attribution of metric gains to the proposed methods remains unverified.
  3. [§6] §6 (Evaluation): The results section must report concrete numerical values, baseline details, statistical significance tests, and exclusion criteria for the claimed outperformance on the refined LawBench; the abstract alone supplies none of these.
minor comments (2)
  1. [§4] A formal diagram or pseudocode for the legal element graph construction would improve clarity in §4.
  2. [Throughout] Ensure consistency in terminology between 'legal element graph' and 'JurisMA' across sections and figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and will revise the paper to address the concerns regarding dataset documentation, experimental ablations, error analysis, and result reporting. Below we respond point by point.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The expert validation of the 43k positive/negative response pairs is presented without inter-annotator agreement figures, annotation guidelines, or disagreement-resolution procedures. This detail is load-bearing for the claim that JurisCQAD supplies reliable high-quality supervision.

    Authors: We agree that explicit documentation of the annotation process is necessary to support claims of dataset quality. In the revised manuscript we will add a new subsection to §3 that (i) reproduces the annotation guidelines given to legal experts, (ii) describes the disagreement-resolution protocol (two-expert review followed by senior adjudicator), and (iii) reports inter-annotator agreement statistics (Cohen’s κ and raw agreement) computed on a 5 % stratified sample of the 43 k pairs. These additions will be placed before the dataset statistics table. revision: yes

  2. Referee: [§5–6] §5–6 (Framework and Experiments): No ablation studies isolate the contribution of the legal element graph construction from the multi-agent routing components, and no error analysis examines whether the graph decomposition introduces new factual or procedural errors. Without these, the attribution of metric gains to the proposed methods remains unverified.

    Authors: We accept that the current experimental design does not fully disentangle the contributions of the legal element graph and the multi-agent routing. We will add two new ablation tables in §5: one that removes the graph construction module while keeping the agents, and another that disables dynamic routing while retaining the graph. In addition, we will insert an error-analysis subsection that manually inspects 200 failure cases, categorizes errors attributable to graph decomposition (factual hallucination, missed legal issue, incorrect event linking), and quantifies their downstream effect on final answer quality. These results will be reported alongside the main experiments in the revised §6. revision: yes

  3. Referee: [§6] §6 (Evaluation): The results section must report concrete numerical values, baseline details, statistical significance tests, and exclusion criteria for the claimed outperformance on the refined LawBench; the abstract alone supplies none of these.

    Authors: We acknowledge that the present version of §6 and the abstract lack the required quantitative detail. The revised manuscript will expand §6 with (i) full numerical tables for all lexical and semantic metrics on the refined LawBench, (ii) explicit baseline specifications (model names, parameter counts, fine-tuning regimes), (iii) statistical significance results (paired t-tests and McNemar’s test with p-values and confidence intervals), and (iv) a clear statement of exclusion criteria applied to the test set. The abstract will be updated to cite the key absolute improvements (e.g., +X BLEU, +Y F1). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and framework evaluation

full rationale

The paper's core contribution is the construction of JurisCQAD (43k expert-annotated queries) and the JurisMA multi-agent framework with legal element graph decomposition, followed by empirical training and evaluation on refined LawBench showing metric gains over baselines. No equations, first-principles derivations, or predictions are claimed; performance claims rest on direct measurement rather than any reduction to fitted inputs or self-citations. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Central claims rest on the reliability of expert annotations and the utility of the proposed decomposition and agent collaboration; no explicit free parameters are stated in the abstract.

axioms (2)
  • domain assumption Expert-validated positive and negative responses provide reliable training and evaluation signals for legal consultation quality.
    Dataset construction and performance claims depend directly on this.
  • domain assumption Converting queries into a legal element graph of entities, events, intents, and issues captures the necessary contextual dependencies.
    This decomposition is presented as the foundation for the multi-agent reasoning.
invented entities (2)
  • Legal element graph no independent evidence
    purpose: Integrates entities, events, intents, and legal issues to enable context-aware reasoning.
    Core representational innovation described in the abstract.
  • JurisMA multi-agent framework no independent evidence
    purpose: Supports dynamic routing, statutory grounding, and stylistic optimization via modular collaboration.
    Proposed system architecture.

pith-pipeline@v0.9.0 · 5474 in / 1457 out tokens · 45034 ms · 2026-05-10T15:52:08.213500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    Chin-Yew Lin

    Lexilaw: A scalable legal language model for comprehensive legal understanding. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81. Antoine Louis, Gijs van Dijck, and Gerasimos Spanakis

  2. [2]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang

    Interpretable long-form legal question answer- ing with retrieval-augmented large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22266–22275. Mingfei Lu, Mengjia Wu, Feng Liu, Jiawei Xu, Weikai Li, Haoyang Wang, Zhengdong Hu, Ying Ding, Yizhou Sun, Jie Lu, and 1 others. 2026. Choosing how to remember: Ad...

  3. [3]

    Bleurt: Learning robust metrics for text gener- ation

    Bleurt: Learning robust metrics for text gener- ation.arXiv preprint arXiv:2004.04696. Yunqiu Shao, Jiaxin Mao, Yiqun Liu, Weizhi Ma, Ken Satoh, Min Zhang, and Shaoping Ma. 2020. Bert-pli: Modeling paragraph-level interactions for legal case retrieval. InIJCAI, volume 2020, pages 3501–3507. Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, and Kun Ku...

  4. [4]

    entities

    Lawgpt: A chinese legal knowledge-enhanced large language model.Preprint, arXiv:2406.04614. A More Details for Experimental Setup A.1 JurisCQAD Dataset Details The core structure and statistics of the JurisCQAD dataset are summarized in Table 1. Property Description Source Real-world legal consulta- tion platforms Language Chinese Size 43,126 triplets Dat...