LegalMidm: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model

Chanhee Park; Heuiseok Lim; Hyeonseok Moon; Jinhyeon Kim; Jiwon Moon; JuKyung Jung; Youngjoon Jang; Young-kyoung Ham

arxiv: 2604.25297 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

LegalMidm: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model

Youngjoon Jang , Chanhee Park , Hyeonseok Moon , Young-kyoung Ham , Jiwon Moon , Jinhyeon Kim , JuKyung Jung , Heuiseok Lim This is my paper

Pith reviewed 2026-05-07 16:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords legal LLMKorean language modeldomain specializationuse-case-driven datasetslegal AItraining pipelineprofessional collaboration

0 comments

The pith

LegalMidm is a Korean legal LLM built from use-case-driven datasets and professional collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LegalMidm, a Korean legal-domain large language model, and a methodology for building high-quality datasets and training pipelines aligned with actual legal work. It claims that general LLMs lack the precision needed for legal applications because their data and training do not reflect real-world legal requirements, while involving legal experts in curation produces models with better relevance and factual accuracy. A sympathetic reader would care because legal tasks require reliable outputs on nuanced points of law, and this focused approach could make domain LLMs usable in professional settings rather than just experimental ones. The work demonstrates the method's results across key legal tasks.

Core claim

We introduce LegalMidm, a Korean legal-domain LLM, along with a systematic training framework that constructs high-quality, use-case-driven legal datasets through collaboration with legal professionals and rigorous curation, then applies optimized training pipelines to achieve relevance and factual accuracy in practical legal applications.

What carries the argument

The use-case-driven legal dataset construction and optimized training pipelines, which rely on collaboration with legal professionals to align model development with real legal needs.

If this is right

Models trained this way can address nuanced Korean legal queries with higher precision than general-purpose LLMs.
Use-case alignment in datasets reduces common failures in factual grounding and task relevance.
Professional involvement in curation produces training data that mirrors actual legal workflows.
The pipeline can be repeated to create domain specialists for other languages or legal systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dataset-building steps could transfer to non-Korean legal domains or to other high-stakes fields such as medicine.
Integration with existing legal databases might allow the model to cite statutes and precedents directly during inference.
Testing the model in live lawyer workflows would reveal whether the accuracy gains translate to time savings in practice.

Load-bearing premise

That collaboration with legal professionals and rigorous data curation will ensure relevance and factual accuracy sufficient for practical legal use.

What would settle it

A side-by-side evaluation of LegalMidm and a general LLM on a set of real Korean legal cases, scored for accuracy and usefulness by practicing lawyers.

Figures

Figures reproduced from arXiv: 2604.25297 by Chanhee Park, Heuiseok Lim, Hyeonseok Moon, Jinhyeon Kim, Jiwon Moon, JuKyung Jung, Youngjoon Jang, Young-kyoung Ham.

**Figure 1.** Figure 1: Performance variation with respect to the domain composition of training data. PLM view at source ↗

read the original abstract

In recent years, the rapid proliferation of open-source large language models (LLMs) has spurred efforts to turn general-purpose models into domain specialists. However, many domain-specialized LLMs are developed using datasets and training protocols that are not aligned with the nuanced requirements of real-world applications. In the legal domain, where precision and reliability are essential, this lack of consideration limits practical utility. In this study, we propose a systematic training framework grounded in the practical needs of the legal domain, with a focus on Korean law. We introduce LegalMidm, a Korean legal-domain LLM, and present a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. Our approach emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, and demonstrates effectiveness in key legal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LegalMidm applies standard fine-tuning to Korean legal data via expert curation, but reports no metrics, baselines, or error analysis to support its effectiveness claims.

read the letter

The paper introduces LegalMidm, a fine-tuned LLM for Korean legal tasks, along with a method to build datasets based on actual legal use cases. The core idea is to work with legal experts to curate data that matches real-world needs rather than just scraping statutes. They do a decent job laying out why generic domain adaptation falls short for law, where accuracy matters a lot. The focus on Korean is practical, since most legal LLMs target English or Chinese. The training pipeline description sounds systematic, with steps for data collection and optimization. The main weakness is the lack of any results. The abstract claims the approach ensures relevance and factual accuracy and demonstrates effectiveness, but there are no numbers, no comparison to base models, no tests on held-out legal questions, and no analysis of where it still fails. Without that, the claim that professional collaboration plus curation delivers usable accuracy is just an assumption. The stress-test note is right on this point; the paper doesn't provide the measurements that would let us check if the method works. This is the kind of paper that could interest people building legal tech tools in Korea or similar languages. It might give them a starting point for data curation. But for someone looking for new techniques or solid evidence of improvement, it doesn't deliver much. It reduces to applying known fine-tuning techniques to a new setting without showing gains. I'd recommend sending it for peer review. The authors can add the missing evaluations during revision, and the local focus makes it worth getting feedback from experts in Korean NLP or legal informatics.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces LegalMidm, a Korean legal-domain LLM, and a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. It emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, claiming to demonstrate effectiveness in key legal tasks.

Significance. If the empirical claims are substantiated, the work could meaningfully advance domain-specific LLM specialization for legal applications, especially in non-English jurisdictions like Korean law, by prioritizing practical use cases over generic adaptation. The focus on professional collaboration and curation is a constructive direction for improving factual reliability in high-stakes domains.

major comments (2)

Abstract: The central claim that the approach 'demonstrates effectiveness in key legal tasks' is unsupported by any metrics, baselines, error analysis, ablation studies, or quantitative results. This absence renders the effectiveness assertion unevaluable and is load-bearing for the paper's contribution.
Abstract: No specifics are provided on dataset construction details, such as curation error rates, inter-expert agreement on statute interpretations, coverage of edge-case precedents, or post-training hallucination rates on held-out legal queries. These measurements are required to substantiate the assumption that professional collaboration plus curation yields practical factual accuracy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We agree that the current manuscript would benefit from stronger empirical grounding and additional methodological details. Below we respond to each major comment and indicate the revisions we will make.

read point-by-point responses

Referee: Abstract: The central claim that the approach 'demonstrates effectiveness in key legal tasks' is unsupported by any metrics, baselines, error analysis, ablation studies, or quantitative results. This absence renders the effectiveness assertion unevaluable and is load-bearing for the paper's contribution.

Authors: We acknowledge that the abstract's phrasing implies quantitative support that is not present in the current manuscript. The work centers on the use-case-driven dataset construction and training methodology rather than on benchmark results. In the revised version we will qualify the abstract claim to focus on the framework's design and will add a dedicated evaluation section containing task-specific metrics, baseline comparisons, error analysis, and any feasible ablation studies on Korean legal tasks. revision: yes
Referee: Abstract: No specifics are provided on dataset construction details, such as curation error rates, inter-expert agreement on statute interpretations, coverage of edge-case precedents, or post-training hallucination rates on held-out legal queries. These measurements are required to substantiate the assumption that professional collaboration plus curation yields practical factual accuracy.

Authors: We agree that greater transparency on curation quality metrics is needed. In the revision we will expand the dataset construction section to report inter-expert agreement statistics (e.g., agreement rates obtained during statute interpretation sessions with legal professionals), quantitative coverage of edge-case precedents derived from our use-case analysis, and observed curation error rates. We will also include post-training evaluation results on held-out legal queries that measure hallucination rates. These additions will directly address the request for evidence supporting the benefits of expert collaboration and curation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; applied methodology paper with no derivations or fitted predictions

full rationale

The paper describes construction of LegalMidm via use-case-driven datasets, collaboration with legal professionals, and training pipelines. No equations, parameter fits, predictions, or first-principles results appear that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on described processes rather than tautological redefinitions or renamings of known results. This matches the default expectation for non-circular applied model-building work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, mathematical axioms, or new physical entities introduced. The work implicitly assumes standard LLM fine-tuning assumptions and that expert-curated data yields factual accuracy.

pith-pipeline@v0.9.0 · 5467 in / 1022 out tokens · 46134 ms · 2026-05-07T16:41:42.881439+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

this document

URLhttps://arxiv.org/abs/2505.09666. ChatKoAlpaca Community. Koalpaca-realqa: A korean instruction dataset reflecting real user scenarios.https://huggingface.co/datasets/beomi/KoAlpaca-RealQA, 2024. Badhan Chandra Das, M. Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.CoRR, abs/2402.00888, 2024. URL https://...

work page doi:10.1007/s00521-024-10495-6 2024
[2]

Query: The user’s request for writing a complaint

work page
[3]

AI Response: The AI assistant’s complaint to be evaluated

work page
[4]

##Evaluation Criteria(each scored 1–10): Evaluate the ‘AI Response‘ based on the following criteria, assigning an **integer** score from 1 to 10 for each:

Gold Answer: The ideal complaint to the ‘Query‘. ##Evaluation Criteria(each scored 1–10): Evaluate the ‘AI Response‘ based on the following criteria, assigning an **integer** score from 1 to 10 for each:

work page
[5]

Factual Clarity: Are the underlying facts in ‘AI Response‘ presented clearly, free from contradiction?

work page
[6]

Legal Foundation: Does the ‘AI Response‘ appropriately identify causes of action, cite statutes, or rely on relevant legal provisions?

work page
[7]

Logical Structure: Is the ‘AI Response‘ logically structured and easy to follow, presenting claims in a coherent manner?

work page
[8]

Factual Clarity

Completeness: Does ‘AI Response‘ address all essential elements of a formal complaint (parties, facts, claims, relief sought) and avoid irrelevant information? ##Scoring Guide: • 1–2: Extremely disorganized; missing core elements, inaccurate facts or law. • 3–4: Some relevant sections covered but with notable errors or omissions. • 5–6: Sufficient clarity...

work page 2026
[10]

AI Response: The AI assistant’s petition to be evaluated

work page
[11]

##Evaluation Criteria(each scored 1–10):

Gold Answer: The ideal petition to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):

work page
[12]

Factual Representation: Does the ‘AI Response‘ accurately and truthfully represent the circumstances?

work page
[13]

Persuasiveness: Is the ‘AI Response‘ appropriately persuasive, balancing sincerity with formality?

work page
[14]

Legal Relevance: Does the ‘AI Response‘ cite or reference legal principles correctly, and is it framed in a manner consistent with a formal legal request?

work page
[15]

Factual Representation

Completeness: Does the ‘AI Response‘ address all essential details required in the ‘Gold Answer‘? ##Scoring Guide: • 1–2: Contains serious factual inaccuracies, insufficient detail, or inappropriate content. • 3–4: Some relevant parts included, but major omissions or inaccuracies persist. • 5–6: Adequate factual correctness, modest persuasiveness; partial...

work page 2026
[16]

Query: The user’s request and context for summarization

work page
[17]

AI Response: The AI assistant’s summary to be evaluated

work page
[18]

##Evaluation Criteria(each scored 1–10):

Gold Answer: The ideal summary to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):

work page
[19]

Accuracy: Does the ‘AI Response‘ capture all critical legal facts and key points from the ‘Gold Answer‘?

work page
[20]

Clarity: Is the ‘AI Response‘ well-structured, to-the-point, and free of extraneous detail?

work page
[21]

Objectivity: Does the ‘AI Response‘ remain neutral, reflecting the original content of ‘Query‘ without bias?

work page
[22]

Accuracy

Relevance: Does every piece of information in the ‘AI Response‘ directly relate to the original case in ‘Query‘? ##Scoring Guide: • 1–2: Significant distortion or omission of main points. • 3–4: Includes some key points, but lacks clarity or correctness. • 5–6: Adequate level of detail; some minor issues in clarity or completeness. • 7–8: Overall high-qua...

work page 2026
[23]

Query: The user’s legal question

work page
[26]

Accuracy: Does the ‘AI Response‘ align with the ‘Gold Answer‘ and is it legally correct?

work page
[27]

Depth: Does the ‘AI Response‘ provide sufficient detail, exploring necessary angles of the legal scenario?

work page
[28]

Clarity: Is the ‘AI Response‘ clear, and logically structured?

work page
[29]

Correctness

Legal Concepts: Does the ‘AI Response‘ appropriately use and explain relevant legal doctrines, statutes, or principles? ##Scoring Guide: • 1–2: The response is severely incorrect or off-topic, demonstrating minimal legal understanding. • 3–4: Partially correct but lacks important details, clarity, or sound reasoning regarding legal concepts. • 5–6: Modera...

work page 2026
[30]

Query: The user’s question and legal passage

work page
[31]

AI Response: The AI assistant’s response to be evaluated

work page
[32]

##Evaluation Criteria(each scored 1–10):

Gold Answer: The ideal response to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):

work page
[33]

Correctness: Does the ‘AI Response‘ accurately match the ‘Gold Answer‘, especially regarding legal facts and conclusions?

work page
[34]

Completeness: Does the ‘AI Response‘ address all parts of the ‘Query‘, covering key details?

work page
[35]

Clarity: Is the ‘AI Response‘ written in a clear, unambiguous manner?

work page
[36]

Correctness

Consistency with Provided Passage: Does the ‘AI Response‘ rely on or align with the source passage in the ‘Query‘, avoiding hallucinations? ##Scoring Guide: • 1–2: Major factual errors, missing essential information, or irrelevance. • 3–4: Partially correct but lacking thoroughness or clarity. • 5–6: Reasonably accurate, with only minor gaps or ambiguitie...

work page

[1] [1]

this document

URLhttps://arxiv.org/abs/2505.09666. ChatKoAlpaca Community. Koalpaca-realqa: A korean instruction dataset reflecting real user scenarios.https://huggingface.co/datasets/beomi/KoAlpaca-RealQA, 2024. Badhan Chandra Das, M. Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.CoRR, abs/2402.00888, 2024. URL https://...

work page doi:10.1007/s00521-024-10495-6 2024

[2] [2]

Query: The user’s request for writing a complaint

work page

[3] [3]

AI Response: The AI assistant’s complaint to be evaluated

work page

[4] [4]

##Evaluation Criteria(each scored 1–10): Evaluate the ‘AI Response‘ based on the following criteria, assigning an **integer** score from 1 to 10 for each:

Gold Answer: The ideal complaint to the ‘Query‘. ##Evaluation Criteria(each scored 1–10): Evaluate the ‘AI Response‘ based on the following criteria, assigning an **integer** score from 1 to 10 for each:

work page

[5] [5]

Factual Clarity: Are the underlying facts in ‘AI Response‘ presented clearly, free from contradiction?

work page

[6] [6]

Legal Foundation: Does the ‘AI Response‘ appropriately identify causes of action, cite statutes, or rely on relevant legal provisions?

work page

[7] [7]

Logical Structure: Is the ‘AI Response‘ logically structured and easy to follow, presenting claims in a coherent manner?

work page

[8] [8]

Factual Clarity

Completeness: Does ‘AI Response‘ address all essential elements of a formal complaint (parties, facts, claims, relief sought) and avoid irrelevant information? ##Scoring Guide: • 1–2: Extremely disorganized; missing core elements, inaccurate facts or law. • 3–4: Some relevant sections covered but with notable errors or omissions. • 5–6: Sufficient clarity...

work page 2026

[9] [10]

AI Response: The AI assistant’s petition to be evaluated

work page

[10] [11]

##Evaluation Criteria(each scored 1–10):

Gold Answer: The ideal petition to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):

work page

[11] [12]

Factual Representation: Does the ‘AI Response‘ accurately and truthfully represent the circumstances?

work page

[12] [13]

Persuasiveness: Is the ‘AI Response‘ appropriately persuasive, balancing sincerity with formality?

work page

[13] [14]

Legal Relevance: Does the ‘AI Response‘ cite or reference legal principles correctly, and is it framed in a manner consistent with a formal legal request?

work page

[14] [15]

Factual Representation

Completeness: Does the ‘AI Response‘ address all essential details required in the ‘Gold Answer‘? ##Scoring Guide: • 1–2: Contains serious factual inaccuracies, insufficient detail, or inappropriate content. • 3–4: Some relevant parts included, but major omissions or inaccuracies persist. • 5–6: Adequate factual correctness, modest persuasiveness; partial...

work page 2026

[15] [16]

Query: The user’s request and context for summarization

work page

[16] [17]

AI Response: The AI assistant’s summary to be evaluated

work page

[17] [18]

##Evaluation Criteria(each scored 1–10):

Gold Answer: The ideal summary to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):

work page

[18] [19]

Accuracy: Does the ‘AI Response‘ capture all critical legal facts and key points from the ‘Gold Answer‘?

work page

[19] [20]

Clarity: Is the ‘AI Response‘ well-structured, to-the-point, and free of extraneous detail?

work page

[20] [21]

Objectivity: Does the ‘AI Response‘ remain neutral, reflecting the original content of ‘Query‘ without bias?

work page

[21] [22]

Accuracy

Relevance: Does every piece of information in the ‘AI Response‘ directly relate to the original case in ‘Query‘? ##Scoring Guide: • 1–2: Significant distortion or omission of main points. • 3–4: Includes some key points, but lacks clarity or correctness. • 5–6: Adequate level of detail; some minor issues in clarity or completeness. • 7–8: Overall high-qua...

work page 2026

[22] [23]

Query: The user’s legal question

work page

[23] [26]

Accuracy: Does the ‘AI Response‘ align with the ‘Gold Answer‘ and is it legally correct?

work page

[24] [27]

Depth: Does the ‘AI Response‘ provide sufficient detail, exploring necessary angles of the legal scenario?

work page

[25] [28]

Clarity: Is the ‘AI Response‘ clear, and logically structured?

work page

[26] [29]

Correctness

Legal Concepts: Does the ‘AI Response‘ appropriately use and explain relevant legal doctrines, statutes, or principles? ##Scoring Guide: • 1–2: The response is severely incorrect or off-topic, demonstrating minimal legal understanding. • 3–4: Partially correct but lacks important details, clarity, or sound reasoning regarding legal concepts. • 5–6: Modera...

work page 2026

[27] [30]

Query: The user’s question and legal passage

work page

[28] [31]

AI Response: The AI assistant’s response to be evaluated

work page

[29] [32]

##Evaluation Criteria(each scored 1–10):

Gold Answer: The ideal response to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):

work page

[30] [33]

Correctness: Does the ‘AI Response‘ accurately match the ‘Gold Answer‘, especially regarding legal facts and conclusions?

work page

[31] [34]

Completeness: Does the ‘AI Response‘ address all parts of the ‘Query‘, covering key details?

work page

[32] [35]

Clarity: Is the ‘AI Response‘ written in a clear, unambiguous manner?

work page

[33] [36]

Correctness

Consistency with Provided Passage: Does the ‘AI Response‘ rely on or align with the source passage in the ‘Query‘, avoiding hallucinations? ##Scoring Guide: • 1–2: Major factual errors, missing essential information, or irrelevance. • 3–4: Partially correct but lacking thoroughness or clarity. • 5–6: Reasonably accurate, with only minor gaps or ambiguitie...

work page