arxiv: 2604.11699 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

Jieying Xue , Phuong Minh Nguyen , Ha Thanh Nguyen , May Myo Zin , Ken Satoh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords legal reasoningfew-shot learningin-context learninglogical formulaslarge language modelssemantic parsingretrieval-augmented generationentity bias

0 comments

The pith

A retrieval strategy that balances semantic and structural similarity in few-shot examples lets LLMs convert legal cases into logical formulas with higher accuracy and stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents Legal2LogicICL, a framework that improves how large language models turn natural-language legal case descriptions into formal logical formulas. Existing pipelines depend on scarce labeled data for fine-tuning, which limits their ability to handle new cases reliably. The method retrieves example cases for in-context learning by combining semantic meaning with legal text structure while reducing the distorting effect of specific entity names. A reader would care because it points toward logic-based legal systems that work without constant retraining or large new datasets. Experiments across open and proprietary models show gains in accuracy, stability, and generalization on unseen cases.

Core claim

Legal2LogicICL is a few-shot retrieval framework for in-context learning that selects exemplars balancing diversity and similarity at both the latent semantic level and the legal text structure level, while explicitly mitigating entity-induced retrieval bias that lets lengthy entity mentions dominate and hide reasoning patterns. This produces robust demonstrations that yield accurate, stable logical rule generation from legal cases without any additional training. The authors also release the Legal2Proleg dataset, which aligns legal cases with PROLEG logical formulas to support evaluation. Results on both open-source and proprietary LLMs confirm significant improvements in accuracy, the new

What carries the argument

Legal2LogicICL, the few-shot retrieval framework that selects exemplars by balancing semantic and legal-structure similarity while correcting for entity bias in legal texts.

If this is right

LLMs generate logical formulas from legal texts with measurably higher accuracy and stability.
The conversion step generalizes better to new cases without fine-tuning or extra labeled data.
The same gains appear on both open-source and proprietary language models.
Logic-based legal reasoning pipelines become more practical because the neural component needs less domain-specific training.
The released Legal2Proleg dataset enables direct benchmarking of future legal semantic parsing approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit correction for entity bias could transfer to other text-to-structure tasks where names overshadow underlying patterns.
Similar retrieval balancing might lower data needs for deploying logic-based AI in regulated fields such as medicine or finance.
Applying the framework to legal cases from multiple jurisdictions would test its robustness to language and reasoning variations.
Pairing Legal2LogicICL outputs with symbolic reasoners could support end-to-end automated review of multi-case legal scenarios.

Load-bearing premise

That balancing semantic similarity with legal text structure and mitigating entity-induced retrieval bias will reliably produce more generalizable logical formulas across unseen cases and different LLMs.

What would settle it

Running the method on a fresh held-out collection of legal cases and finding that logical formula accuracy shows no improvement or drops compared to standard similarity-based retrieval.

Figures

Figures reproduced from arXiv: 2604.11699 by Ha Thanh Nguyen, Jieying Xue, Ken Satoh, May Myo Zin, Phuong Minh Nguyen.

**Figure 2.** Figure 2: Performance comparison between NER-based [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison on the effect of the diver [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of our ranking method (DiverseSim) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Legal2LogicICL, a retrieval-augmented in-context learning framework that selects diverse few-shot exemplars for LLMs by balancing latent semantic similarity with legal text structure and explicitly mitigating entity-induced retrieval bias. It presents the new Legal2Proleg dataset of legal cases aligned to PROLEG logical formulas and claims that this approach yields improved accuracy, stability, and generalization in mapping natural-language legal descriptions to logical representations on both open-source and proprietary LLMs, without any fine-tuning.

Significance. If the empirical gains hold under rigorous evaluation, the work offers a practical, training-free method for legal semantic parsing that directly tackles data scarcity in logic-based legal reasoning. The public release of the Legal2Proleg dataset and code is a clear strength that supports reproducibility and follow-on research. The emphasis on legal-specific biases (entity mentions dominating semantics) distinguishes the retrieval strategy from generic ICL methods.

major comments (2)

[Experimental Results] Experimental section: the central claim of 'significantly improves accuracy, stability, and generalization' is not supported by any reported numerical results, baselines (e.g., random or semantic-only retrieval), variance measures, or statistical tests in the provided text; without these the magnitude and reliability of the gains cannot be assessed.
[§3] §3 (Legal2LogicICL framework): the precise weighting or selection algorithm that balances semantic similarity, legal structure, and entity-bias mitigation is described only at a high level; a formal definition or pseudocode is needed to determine whether the method is reproducible and whether the balancing is parameter-free or tuned.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief statement of dataset size, number of test cases, and LLM models evaluated to give readers an immediate sense of experimental scale.
[§3] Notation for the retrieval scoring function (semantic + structural + bias terms) should be introduced with an equation rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving the clarity and rigor of our presentation. We address each major comment below and will revise the manuscript to incorporate the suggested enhancements.

read point-by-point responses

Referee: [Experimental Results] Experimental section: the central claim of 'significantly improves accuracy, stability, and generalization' is not supported by any reported numerical results, baselines (e.g., random or semantic-only retrieval), variance measures, or statistical tests in the provided text; without these the magnitude and reliability of the gains cannot be assessed.

Authors: We agree that the experimental section in the current version does not provide sufficient numerical detail to fully substantiate the claims. In the revised manuscript, we will expand the experimental section to include explicit accuracy metrics (e.g., exact match and semantic similarity scores), direct comparisons against baselines including random retrieval and semantic-only retrieval, variance measures such as standard deviations across multiple runs for stability assessment, and statistical significance tests (e.g., paired t-tests) to evaluate improvements in accuracy, stability, and generalization. These additions will be presented for both open-source and proprietary LLMs, allowing readers to assess the magnitude and reliability of the reported gains. revision: yes
Referee: [§3] §3 (Legal2LogicICL framework): the precise weighting or selection algorithm that balances semantic similarity, legal structure, and entity-bias mitigation is described only at a high level; a formal definition or pseudocode is needed to determine whether the method is reproducible and whether the balancing is parameter-free or tuned.

Authors: We acknowledge that Section 3 provides only a high-level description. To ensure reproducibility, we will add a formal definition of the retrieval scoring function as a weighted combination of three components: (1) latent semantic similarity via cosine similarity on sentence embeddings, (2) legal structure similarity computed from parsed elements such as predicates and relations, and (3) an entity-bias mitigation term that normalizes or masks entity mentions. We will also include pseudocode for the overall exemplar selection procedure. The weights are determined on a small held-out validation set rather than being parameter-free; we will report the specific weights used in our experiments and clarify that no LLM fine-tuning is involved. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical validation of retrieval framework

full rationale

The paper proposes Legal2LogicICL as a retrieval-augmented few-shot method that balances semantic similarity, legal text structure, and entity-bias mitigation, introduces the Legal2Proleg dataset with PROLEG alignments, and reports experimental gains in accuracy/stability/generalization across LLMs. No equations, derivations, or predictions appear; the central claim rests on external empirical measurements rather than any step that reduces by construction to fitted inputs, self-definitions, or self-citation chains. The method is presented as an engineering solution to data scarcity, with public code and dataset enabling independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method assumes standard retrieval and in-context learning mechanisms from prior NLP literature; no new free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5603 in / 1213 out tokens · 42124 ms · 2026-05-10T15:06:13.571884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)

work page internal anchor Pith review arXiv 2024
[2]

Imran Sarwar Bajwa, Mark G Lee, and Behzad Bordbar. 2011. SBVR Business Rules Generation from Natural Language Specification.. InAAAI spring symposium: AI for business agility. 2–8

2011
[3]

Tom Brown, Benjamin Mann, Nick Ryder, and et al. 2020. Language Models are Few-Shot Learners. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901

2020
[4]

Amir DN Cohen, Shauli Ravfogel, Shaltiel Shmidman, and Yoav Goldberg. 2024. Diversity Over Quantity: A Lesson From Few Shot Relation Classification.arXiv preprint arXiv:2412.05434(2024)

work page arXiv 2024
[5]

Shruti Gaur, Nguyen H Vo, Kazuaki Kashihara, and Chitta Baral. 2014. Translating simple legal text to formal representations. InJSAI International Symposium on Artificial Intelligence. Springer, 259–273

2014
[6]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, and et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Shivanshu Gupta, Sameer Singh, and Matt Gardner. 2022. Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation. InFind- ings of the Association for Computational Linguistics: EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguis- tics, Abu Dhabi, United Arab Emirates, 496...

work page doi:10.18653/v1/2022.findings- 2022
[8]

Nikolaos Lagos, Frederique Segond, Stefania Castellani, and Jacki O’Neill. 2010. Event extraction for legal case building and reasoning. InInternational Conference on Intelligent Information Processing. Springer, 92–101

2010
[9]

Itay Levy, Ben Bogin, and Jonathan Berant. 2023. Diverse Demonstrations Im- prove In-context Compositional Generalization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers). Association for Computational Linguistics, Toronto, Canada, 1401–1422. https://aclanthology.org/2023.acl-long.78

2023
[10]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F...

2020
[11]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What Makes Good In-Context Examples for GPT-3?. InPro- ceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowl- edge Extraction and Integration for Deep Learning Architectures, Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (Eds.). Associat...

work page doi:10.18653/v1/2022.deelio-1.10 2022
[12]

L Thorne McCarty. 2007. Deep semantic interpretations of legal texts. InPro- ceedings of the 11th international conference on Artificial intelligence and law. 217–224

2007
[13]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Comput...

work page doi:10.18653/v1/2022.emnlp- 2022
[14]

Phuong Minh Nguyen, Truong Dinh Do, and Minh Le Nguyen. 2025. Improving hierarchical semantic parsing with LLMs: Demonstration selection and chain-of- thought prompting via semantic fragment decoding.Knowledge-Based Systems 328 (2025), 114256. doi:10.1016/j.knosys.2025.114256

work page doi:10.1016/j.knosys.2025.114256 2025
[15]

María Navas-Loro, Ken Satoh, and Víctor Rodríguez-Doncel. 2018. Contract- Frames: bridging the gap between natural language and logics in contract law. In JSAI International Symposium on Artificial Intelligence. Springer, 101–114

2018
[16]

Ha-Thanh Nguyen, Fungwacharakorn Wachara, Fumihito Nishino, and Ken Satoh
[17]

InLegal Knowledge and Information Systems

A multi-step approach in translating natural language into logical formula. InLegal Knowledge and Information Systems. IOS Press, 103–112
[18]

Minh-Phuong Nguyen, Thi-Thu-Trang Nguyen, Vu Tran, Ha-Thanh Nguyen, Le-Minh Nguyen, and Ken Satoh. 2022. Learning to Map the GDPR to Logic Representation on DAPRECO-KB. InIntelligent Information and Database Systems, Ngoc Thanh Nguyen, Tien Khoa Tran, Ualsher Tukayev, Tzung-Pei Hong, Bogdan Trawiński, and Edward Szczerbicki (Eds.). Springer International ...

2022
[19]

Panupong Pasupat, Yuan Zhang, and Kelvin Guu. 2021. Controllable Seman- tic Parsing via Retrieval Augmentation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Com- putational Linguistics, Online and Punta Cana, Domin...

work page doi:10.18653/v1/2021.emnlp-main.607 2021
[20]

Nguyen Phuong, Nguyen Thanh, May Zin, and Ken Satoh. 2026. Data Augmented Pipeline for Legal Information Extraction and Reasoning. InProceedings of the Twentieth International Conference on Artificial Intelligence and Law (ICAIL ’25). Association for Computing Machinery, New York, NY, USA, 481–482. doi:10. 1145/3769126.3769200

work page arXiv 2026
[21]

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning To Retrieve Prompts for In-Context Learning. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Co...

work page doi:10.18653/v1/2022.naacl-main.191 2022
[22]

2023.PROLEG: Practical Legal Reasoning System

Ken Satoh. 2023.PROLEG: Practical Legal Reasoning System. Springer Nature Switzerland, Cham, 277–283. https://doi.org/10.1007/978-3-031-35254-6_23

work page doi:10.1007/978-3-031-35254-6_23 2023
[23]

Ken Satoh, Masahiro Kubota, Yoshiaki Nishigai, and Chiaki Takano. 2009. Trans- lating the Japanese presupposed ultimate fact theory into logic programming. In Legal Knowledge and Information Systems. IOS Press, 162–171

2009
[24]

Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Em- manouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. Constrained Language Models Yield Few-Shot Semantic Parsers. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang,...

work page doi:10.18653/v1/2021.emnlp-main.608 2021
[25]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE.Journal of Machine Learning Research9, 86 (2008), 2579–2605. http: //jmlr.org/papers/v9/vandermaaten08a.html

2008
[26]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Yiming Zhang, Shi Feng, and Chenhao Tan. 2022. Active example selection for in-context learning.arXiv preprint arXiv:2211.04486(2022)

work page arXiv 2022
[28]

May Myo Zin, Ha Thanh Nguyen, Ken Satoh, Saku Sugawara, and Fumihito Nishino. 2023. Improving translation of case descriptions into logical fact formu- las using legalcasener. InProceedings of the Nineteenth International Conference on Artificial Intelligence and Law. 462–466

2023