LegalMidm: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model
Pith reviewed 2026-05-07 16:41 UTC · model grok-4.3
The pith
LegalMidm is a Korean legal LLM built from use-case-driven datasets and professional collaboration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce LegalMidm, a Korean legal-domain LLM, along with a systematic training framework that constructs high-quality, use-case-driven legal datasets through collaboration with legal professionals and rigorous curation, then applies optimized training pipelines to achieve relevance and factual accuracy in practical legal applications.
What carries the argument
The use-case-driven legal dataset construction and optimized training pipelines, which rely on collaboration with legal professionals to align model development with real legal needs.
If this is right
- Models trained this way can address nuanced Korean legal queries with higher precision than general-purpose LLMs.
- Use-case alignment in datasets reduces common failures in factual grounding and task relevance.
- Professional involvement in curation produces training data that mirrors actual legal workflows.
- The pipeline can be repeated to create domain specialists for other languages or legal systems.
Where Pith is reading between the lines
- The same dataset-building steps could transfer to non-Korean legal domains or to other high-stakes fields such as medicine.
- Integration with existing legal databases might allow the model to cite statutes and precedents directly during inference.
- Testing the model in live lawyer workflows would reveal whether the accuracy gains translate to time savings in practice.
Load-bearing premise
That collaboration with legal professionals and rigorous data curation will ensure relevance and factual accuracy sufficient for practical legal use.
What would settle it
A side-by-side evaluation of LegalMidm and a general LLM on a set of real Korean legal cases, scored for accuracy and usefulness by practicing lawyers.
Figures
read the original abstract
In recent years, the rapid proliferation of open-source large language models (LLMs) has spurred efforts to turn general-purpose models into domain specialists. However, many domain-specialized LLMs are developed using datasets and training protocols that are not aligned with the nuanced requirements of real-world applications. In the legal domain, where precision and reliability are essential, this lack of consideration limits practical utility. In this study, we propose a systematic training framework grounded in the practical needs of the legal domain, with a focus on Korean law. We introduce LegalMidm, a Korean legal-domain LLM, and present a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. Our approach emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, and demonstrates effectiveness in key legal tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LegalMidm, a Korean legal-domain LLM, and a methodology for constructing high-quality, use-case-driven legal datasets and optimized training pipelines. It emphasizes collaboration with legal professionals and rigorous data curation to ensure relevance and factual accuracy, claiming to demonstrate effectiveness in key legal tasks.
Significance. If the empirical claims are substantiated, the work could meaningfully advance domain-specific LLM specialization for legal applications, especially in non-English jurisdictions like Korean law, by prioritizing practical use cases over generic adaptation. The focus on professional collaboration and curation is a constructive direction for improving factual reliability in high-stakes domains.
major comments (2)
- Abstract: The central claim that the approach 'demonstrates effectiveness in key legal tasks' is unsupported by any metrics, baselines, error analysis, ablation studies, or quantitative results. This absence renders the effectiveness assertion unevaluable and is load-bearing for the paper's contribution.
- Abstract: No specifics are provided on dataset construction details, such as curation error rates, inter-expert agreement on statute interpretations, coverage of edge-case precedents, or post-training hallucination rates on held-out legal queries. These measurements are required to substantiate the assumption that professional collaboration plus curation yields practical factual accuracy.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We agree that the current manuscript would benefit from stronger empirical grounding and additional methodological details. Below we respond to each major comment and indicate the revisions we will make.
read point-by-point responses
-
Referee: Abstract: The central claim that the approach 'demonstrates effectiveness in key legal tasks' is unsupported by any metrics, baselines, error analysis, ablation studies, or quantitative results. This absence renders the effectiveness assertion unevaluable and is load-bearing for the paper's contribution.
Authors: We acknowledge that the abstract's phrasing implies quantitative support that is not present in the current manuscript. The work centers on the use-case-driven dataset construction and training methodology rather than on benchmark results. In the revised version we will qualify the abstract claim to focus on the framework's design and will add a dedicated evaluation section containing task-specific metrics, baseline comparisons, error analysis, and any feasible ablation studies on Korean legal tasks. revision: yes
-
Referee: Abstract: No specifics are provided on dataset construction details, such as curation error rates, inter-expert agreement on statute interpretations, coverage of edge-case precedents, or post-training hallucination rates on held-out legal queries. These measurements are required to substantiate the assumption that professional collaboration plus curation yields practical factual accuracy.
Authors: We agree that greater transparency on curation quality metrics is needed. In the revision we will expand the dataset construction section to report inter-expert agreement statistics (e.g., agreement rates obtained during statute interpretation sessions with legal professionals), quantitative coverage of edge-case precedents derived from our use-case analysis, and observed curation error rates. We will also include post-training evaluation results on held-out legal queries that measure hallucination rates. These additions will directly address the request for evidence supporting the benefits of expert collaboration and curation. revision: yes
Circularity Check
No significant circularity; applied methodology paper with no derivations or fitted predictions
full rationale
The paper describes construction of LegalMidm via use-case-driven datasets, collaboration with legal professionals, and training pipelines. No equations, parameter fits, predictions, or first-principles results appear that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on described processes rather than tautological redefinitions or renamings of known results. This matches the default expectation for non-circular applied model-building work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2505.09666. ChatKoAlpaca Community. Koalpaca-realqa: A korean instruction dataset reflecting real user scenarios.https://huggingface.co/datasets/beomi/KoAlpaca-RealQA, 2024. Badhan Chandra Das, M. Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.CoRR, abs/2402.00888, 2024. URL https://...
-
[2]
Query: The user’s request for writing a complaint
-
[3]
AI Response: The AI assistant’s complaint to be evaluated
-
[4]
Gold Answer: The ideal complaint to the ‘Query‘. ##Evaluation Criteria(each scored 1–10): Evaluate the ‘AI Response‘ based on the following criteria, assigning an **integer** score from 1 to 10 for each:
-
[5]
Factual Clarity: Are the underlying facts in ‘AI Response‘ presented clearly, free from contradiction?
-
[6]
Legal Foundation: Does the ‘AI Response‘ appropriately identify causes of action, cite statutes, or rely on relevant legal provisions?
-
[7]
Logical Structure: Is the ‘AI Response‘ logically structured and easy to follow, presenting claims in a coherent manner?
-
[8]
Completeness: Does ‘AI Response‘ address all essential elements of a formal complaint (parties, facts, claims, relief sought) and avoid irrelevant information? ##Scoring Guide: • 1–2: Extremely disorganized; missing core elements, inaccurate facts or law. • 3–4: Some relevant sections covered but with notable errors or omissions. • 5–6: Sufficient clarity...
work page 2026
-
[10]
AI Response: The AI assistant’s petition to be evaluated
-
[11]
##Evaluation Criteria(each scored 1–10):
Gold Answer: The ideal petition to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):
-
[12]
Factual Representation: Does the ‘AI Response‘ accurately and truthfully represent the circumstances?
-
[13]
Persuasiveness: Is the ‘AI Response‘ appropriately persuasive, balancing sincerity with formality?
-
[14]
Legal Relevance: Does the ‘AI Response‘ cite or reference legal principles correctly, and is it framed in a manner consistent with a formal legal request?
-
[15]
Completeness: Does the ‘AI Response‘ address all essential details required in the ‘Gold Answer‘? ##Scoring Guide: • 1–2: Contains serious factual inaccuracies, insufficient detail, or inappropriate content. • 3–4: Some relevant parts included, but major omissions or inaccuracies persist. • 5–6: Adequate factual correctness, modest persuasiveness; partial...
work page 2026
-
[16]
Query: The user’s request and context for summarization
-
[17]
AI Response: The AI assistant’s summary to be evaluated
-
[18]
##Evaluation Criteria(each scored 1–10):
Gold Answer: The ideal summary to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):
-
[19]
Accuracy: Does the ‘AI Response‘ capture all critical legal facts and key points from the ‘Gold Answer‘?
-
[20]
Clarity: Is the ‘AI Response‘ well-structured, to-the-point, and free of extraneous detail?
-
[21]
Objectivity: Does the ‘AI Response‘ remain neutral, reflecting the original content of ‘Query‘ without bias?
-
[22]
Relevance: Does every piece of information in the ‘AI Response‘ directly relate to the original case in ‘Query‘? ##Scoring Guide: • 1–2: Significant distortion or omission of main points. • 3–4: Includes some key points, but lacks clarity or correctness. • 5–6: Adequate level of detail; some minor issues in clarity or completeness. • 7–8: Overall high-qua...
work page 2026
-
[23]
Query: The user’s legal question
-
[26]
Accuracy: Does the ‘AI Response‘ align with the ‘Gold Answer‘ and is it legally correct?
-
[27]
Depth: Does the ‘AI Response‘ provide sufficient detail, exploring necessary angles of the legal scenario?
-
[28]
Clarity: Is the ‘AI Response‘ clear, and logically structured?
-
[29]
Legal Concepts: Does the ‘AI Response‘ appropriately use and explain relevant legal doctrines, statutes, or principles? ##Scoring Guide: • 1–2: The response is severely incorrect or off-topic, demonstrating minimal legal understanding. • 3–4: Partially correct but lacks important details, clarity, or sound reasoning regarding legal concepts. • 5–6: Modera...
work page 2026
-
[30]
Query: The user’s question and legal passage
-
[31]
AI Response: The AI assistant’s response to be evaluated
-
[32]
##Evaluation Criteria(each scored 1–10):
Gold Answer: The ideal response to the ‘Query‘. ##Evaluation Criteria(each scored 1–10):
-
[33]
Correctness: Does the ‘AI Response‘ accurately match the ‘Gold Answer‘, especially regarding legal facts and conclusions?
-
[34]
Completeness: Does the ‘AI Response‘ address all parts of the ‘Query‘, covering key details?
-
[35]
Clarity: Is the ‘AI Response‘ written in a clear, unambiguous manner?
-
[36]
Consistency with Provided Passage: Does the ‘AI Response‘ rely on or align with the source passage in the ‘Query‘, avoiding hallucinations? ##Scoring Guide: • 1–2: Major factual errors, missing essential information, or irrelevance. • 3–4: Partially correct but lacking thoroughness or clarity. • 5–6: Reasonably accurate, with only minor gaps or ambiguitie...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.