arxiv: 2604.23809 · v1 · submitted 2026-04-26 · 💻 cs.CL

Recognition: unknown

LegalDrill: Diagnosis-Driven Synthesis for Legal Reasoning in Small Language Models

Tianchun Li , Haochen Liu , Vishwa Pardeshi , Xingchen Wang , Tianci Liu , Huijun Zhao , Wei Fan , Jing Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords legal reasoningsmall language modelsdata synthesisreasoning trajectoriesself-reflective verificationsupervised fine-tuningdirect preference optimizationlegal benchmarks

0 comments

The pith

LegalDrill extracts and verifies reasoning trajectories from a teacher model to train small language models for legal tasks without expert annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LegalDrill to strengthen legal reasoning in small language models, which are efficient but limited in handling complex statute interpretation and deduction. It generates detailed reasoning paths by prompting a larger teacher model at a fine grain, then uses self-reflective verification to pick the most consistent trajectories for training. These trajectories feed into supervised fine-tuning and direct preference optimization of the student model. The approach matters because high-quality legal data is scarce and costly to collect manually, while standard sampling methods lack the needed granularity. If the method holds, it creates a practical way to scale capable legal reasoning systems on affordable hardware.

Core claim

LegalDrill is a diagnosis-driven synthesis framework that extracts reasoning trajectories from a capable teacher via fine-grained prompting, applies self-reflective verification to select the most effective data, and uses the resulting dataset to train small language models through supervised fine-tuning and direct preference optimization, yielding stronger performance on legal benchmarks without expert annotations.

What carries the argument

Diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a teacher model through targeted prompting and adaptive verification to curate training data for the student model.

If this is right

Representative small language models achieve higher accuracy on multiple legal reasoning benchmarks.
Training proceeds without dependence on scarce expert-annotated reasoning paths.
The combination of supervised fine-tuning and direct preference optimization produces more coherent legal deductions.
The method offers a scalable alternative to manual data curation or coarse rejection sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting-plus-verification loop could be tested on other structured reasoning domains such as medical diagnosis or scientific hypothesis generation.
Repeated cycles of diagnosis and selection might compound data quality gains beyond a single pass.
Trained small models could enable lower-cost deployment in legal technology tools where large models are currently required.

Load-bearing premise

Fine-grained prompting combined with self-reflective verification can reliably produce logically consistent and high-quality reasoning trajectories that outperform standard methods without introducing harmful teacher-model biases.

What would settle it

An experiment in which small models trained on LegalDrill data show no meaningful gains or outright underperform compared with models trained on standard rejection-sampled trajectories or human-annotated legal data across the same benchmarks.

Figures

Figures reproduced from arXiv: 2604.23809 by Haochen Liu, Huijun Zhao, Jing Gao, Tianchun Li, Tianci Liu, Vishwa Pardeshi, Wei Fan, Xingchen Wang.

**Figure 1.** Figure 1: The overview of LegalDrill. Overall, each original sample x yields K preference pairs, leading to a synthesized dataset of size K · |D|: D t syn = {(x, y (k) + , y (k) − ) | x ∈ D, k = 1, . . . , K}. 3.2 Self-Reflective Quality Verification The synthetic pairs Dt syn generated in the previous stage may contain samples that are trivial if the student model πθt can already determine the correct reasoning. … view at source ↗

**Figure 2.** Figure 2: Illustrative examples of legal reasoning refinement of LegalDrill. After the optimization, the student view at source ↗

**Figure 3.** Figure 3: The ablation study on DPO view at source ↗

**Figure 4.** Figure 4: The ablation study on the number of iterations. acc F1 Judge view at source ↗

**Figure 5.** Figure 5: The ablation study on the number of K view at source ↗

read the original abstract

Small language models (SLMs) are promising for real-world deployment due to their efficiency and low operational cost. However, their limited capacity struggles with high-stakes legal reasoning tasks that require coherent statute interpretation and logically consistent deduction. Furthermore, training SLMs for such tasks demands high-quality, concise reasoning trajectories, which are prohibitively expensive to manually collect and difficult to curate via standard rejection sampling, lacking granularity beyond final verdicts. To address these challenges, we propose {LegalDrill}, a diagnosis-driven synthesis framework that extracts and iteratively refines reasoning trajectories from a capable teacher via fine-grained prompting, then a self-reflective verification is employed to adaptively select the most effective data for the SLM student. The resulting data empower SLM training through supervised fine-tuning and direct preference optimization. Extensive experiments on several legal benchmarks demonstrate that {LegalDrill} significantly bolsters the legal reasoning capabilities of representative SLMs while bypassing the need for scarce expert annotations, paving a scalable path toward practical legal reasoning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces LegalDrill, a diagnosis-driven synthesis framework for generating high-quality reasoning trajectories to train small language models (SLMs) on legal reasoning tasks. It extracts trajectories from a teacher model via fine-grained prompting, applies self-reflective verification to select effective data, and uses the resulting dataset for supervised fine-tuning (SFT) followed by direct preference optimization (DPO). The central claim is that this approach significantly improves SLM performance on legal benchmarks while avoiding the need for scarce expert annotations.

Significance. If the empirical results hold with proper controls and baselines, the work could provide a practical path for deploying efficient, domain-specialized SLMs in legal applications where large models are costly and expert data is limited. The emphasis on iterative diagnosis and self-reflection for synthetic trajectory curation is a targeted contribution to data synthesis methods in high-stakes reasoning domains.

major comments (2)

[Abstract] Abstract: The claim that 'extensive experiments on several legal benchmarks demonstrate that LegalDrill significantly bolsters the legal reasoning capabilities of representative SLMs' is presented without any quantitative metrics, baseline comparisons, error bars, statistical tests, or experimental details. This is load-bearing for the central empirical claim, as the abstract supplies no evidence to evaluate whether the improvements are real, substantial, or reliable.
[Method] Method description (self-reflective verification): The framework description does not specify what constitutes a 'diagnosis,' how the reflection step scores logical consistency or statute-interpretation accuracy, or whether verification uses an independent judge versus the teacher model itself. Without these details it is impossible to assess whether the selected trajectories are free of teacher biases or hallucinations, which directly affects the claim that the method is superior to standard rejection sampling.

minor comments (1)

[Abstract] The repeated use of curly braces around LegalDrill in the abstract appears to be an unrendered LaTeX command and should be corrected for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and will revise the manuscript to improve clarity and evidence presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'extensive experiments on several legal benchmarks demonstrate that LegalDrill significantly bolsters the legal reasoning capabilities of representative SLMs' is presented without any quantitative metrics, baseline comparisons, error bars, statistical tests, or experimental details. This is load-bearing for the central empirical claim, as the abstract supplies no evidence to evaluate whether the improvements are real, substantial, or reliable.

Authors: We agree that the abstract would benefit from including select quantitative highlights to support the central claim. In the revised version, we will add concise metrics (e.g., average accuracy gains over baselines across the legal benchmarks) and a brief note that full results with error bars, statistical tests, and baseline comparisons appear in Section 4. This keeps the abstract focused while providing immediate evidence of the improvements. revision: yes
Referee: [Method] Method description (self-reflective verification): The framework description does not specify what constitutes a 'diagnosis,' how the reflection step scores logical consistency or statute-interpretation accuracy, or whether verification uses an independent judge versus the teacher model itself. Without these details it is impossible to assess whether the selected trajectories are free of teacher biases or hallucinations, which directly affects the claim that the method is superior to standard rejection sampling.

Authors: We acknowledge the need for greater specificity on the self-reflective verification component. We will expand the method section to define a 'diagnosis' as the teacher model's identification of granular errors (e.g., logical inconsistencies or statute misinterpretations) via fine-grained prompts. The reflection step scores trajectories using a rubric for logical consistency and accuracy, performed by the teacher model itself in a self-reflective loop rather than an independent judge. We will include pseudocode, an example trajectory, and explicit comparison to standard rejection sampling (which only evaluates final outputs) to show how the approach reduces biases and hallucinations. Our existing experiments already demonstrate superiority over rejection-sampling baselines on the legal tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation

full rationale

The paper describes an empirical synthesis framework (fine-grained prompting + self-reflective verification + SFT/DPO) whose central claims are evaluated via experiments on external legal benchmarks. No equations, predictions, or first-principles derivations appear that reduce by construction to fitted inputs or self-citations. The method's effectiveness is not presupposed by its own definitions but is presented as a testable outcome of the proposed pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on untested assumptions about the superiority of the proposed data synthesis process over existing methods; no free parameters or invented entities are explicitly named in the abstract, but the approach implicitly relies on standard LLM training assumptions.

axioms (2)

domain assumption A capable teacher model can generate logically consistent legal reasoning trajectories via fine-grained prompting
Invoked as the source of training data in the synthesis step
domain assumption Self-reflective verification can adaptively select higher-quality data than standard rejection sampling
Central to the data selection mechanism claimed to outperform baselines

pith-pipeline@v0.9.0 · 5490 in / 1403 out tokens · 82720 ms · 2026-05-08T06:02:07.786729+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others

Adapting large language models to do- mains via reading comprehension.arXiv preprint arXiv:2309.09530. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long contex...

work page arXiv 2025
[2]

Chatlaw: Open- source legal large language model with integrated exter- nal knowledge bases

Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture- of-experts large language model.arXiv preprint arXiv:2306.16092. Xin Dai, Buqiang Xu, Zhenghao Liu, Yukun Yan, Huiyuan Xie, Xiaoyuan Yi, Shuo Wang, and Ge Yu. 2025. Legal ∆: Enhancing legal reason- ing in llms via reinforcement learning with chain- of-thought guide...

work page arXiv 2025
[3]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Rubrics as rewards: Reinforcement learn- ing beyond verifiable domains.arXiv preprint arXiv:2507.17746. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, Xiao Bi, and 1 others. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.129...

work page internal anchor Pith review arXiv 2025
[4]

Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024

Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Lo...

work page arXiv 2015
[5]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever

PMLR. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. 9 Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn

2019
[6]

Leonardo Ranaldi and Andre Freitas

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Leonardo Ranaldi and Andre Freitas. 2024. Aligning large and small language models via chain-of-thought reasoning. InProceedings of the 18th Conference of the European Chapter of the Association for Compu- tatio...

work page arXiv 2024
[7]

In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track), pages 134–148

Gkd: A general knowledge distillation frame- work for large-scale pre-trained language model. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 5: Industry Track), pages 134–148. Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, and 1 ...

work page arXiv 2025
[8]

Disc-lawllm: Fine-tuning large language models for intelligent legal services,

Disc-lawllm: Fine-tuning large language mod- els for intelligent legal services.arXiv preprint arXiv:2309.11325. Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, and Ningyu Zhang. 2025. Lightthinker: Thinking step- by-step compression. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- gu...

work page arXiv 2025
[9]

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

A survey of process reward models: From outcome signals to process supervisions for large language models.arXiv preprint arXiv:2510.08049. Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song, Xiao-Wen Yang, Yi-Xuan Jin, Lan-Zhe Guo, and Yu-Feng Li. 2024. Lawgpt: A chinese legal knowledge- enhanced large language model.arXiv preprint arXiv:2406.04614. 11 A More Experi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Diagnose:Evaluate the correctness and reasoning of the student’s answer against the ground truth
[11]

Internal Evaluation Process:

Instruction Generation:If an error exists, provide a specific, abstract instruction on how to reproducethis logic error in a completely different legal context. Internal Evaluation Process:
[12]

Verify if the student’s final answer matches the ground truth
[13]

Check reasoning for logical soundness (e.g., missed conditions, hallucinations)
[14]

Classify any flaws using the providedError Taxonomy
[15]

Identify a condition in the text that limits a right, and generate a response that treats the right as absolute by deliberately ignoring that condition

Draft areproduction_instructionfor the Teacher AI. Reproduction Instruction Guidelines: When writing the instruction, you must beContext-AgnosticandActionable. Do NOT mention specific entities or clauses. • Example (Good):"Identify a condition in the text that limits a right, and generate a response that treats the right as absolute by deliberately ignori...
[16]

I am making an error

Flaw Embodiment:You must generate a step-by-step reasoning process that naturally embodies the specified error (e.g., ignoring a condition) without explicitly stating "I am making an error." 3.Target Outcome:Your reasoning must plausibly lead to theoppositeof the Ground Truth. Constraints: Do NOT mention the error summary or the ground truth in your outpu...
[17]

Do NOT include any preamble or repetition of instructions

Immediate Reasoning:Start the response immediately with the flawed step-by-step reasoning. Do NOT include any preamble or repetition of instructions
[18]

Follow the Reproduction Instruction to manipulate the logic (e.g., if instructed to ignore a condition, simply fail to mention it)

Twist the Logic:Plausibly embody the flaws listed in Error Types . Follow the Reproduction Instruction to manipulate the logic (e.g., if instructed to ignore a condition, simply fail to mention it)
[19]

Final Answer: Yes

Opposite Conclusion:The reasoning must naturally lead to a final answer that is theopposite of the Correct Answer. 4.Formatting:Strict adherence to the provided output structure is required. 5.Final Answer:Conclude strictly with "Final Answer: Yes" or "Final Answer: No". When generating the chosen sample, the prompt is similar except highlighting that the...
[20]

Final Answer: Yes

Final Answer:Conclude clearly on a new line with "Final Answer: Yes" or "Final Answer: No". D.4 Judge Model Prompt Judge System Prompt You are a strict legal reasoning judge. Field definitions: question = the claim/question to evaluate; contract = the governing legal text/context only; ground_truth = the gold final answer for the question under the contra...