pith. sign in

arxiv: 2604.17153 · v1 · submitted 2026-04-18 · 💻 cs.CL · cs.AI

From Legal Text to Executable Decision Models: Evaluating Structured Representations for Legal Decision Model Generation

Pith reviewed 2026-05-10 06:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal decision modelslarge language modelsstructured representationsinput output constraintssemantic role labelsexecutable logicgraph similarityfunctional equivalence
0
0 comments X

The pith

Input and output constraints substantially improve large language model generation of executable decision models from legal text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal texts must be converted into precise executable logic for applications like checking environmental permits, but this conversion has long required heavy manual work. The paper tests whether adding intermediate structured information to the input text helps LLMs produce better decision models. Using a dataset of 95 real production models paired with Dutch environmental planning law texts, it compares raw text against versions augmented with semantic role labels, input/output constraints, or both. Input/output constraints deliver the largest gains, raising structural similarity to gold models by 37-54 percent, while semantic role labels add only modest improvement. Generated models agree with gold standards on 51-53 percent of test scenarios yet tend to be smaller because they drop redundant logic that accounts for up to 55 percent of nodes in the originals.

Core claim

Enriching legal text with input and output constraints produces decision models whose graph structure matches gold-standard models 37-54 percent better than a raw-text baseline. Semantic role labels yield smaller gains. When the generated models are executed on pre-configured test scenarios, they produce the same outcomes as the gold models in 51-53 percent of cases. The generated models are typically smaller and simpler, removing redundant pass-through logic that comprises 45-55 percent of nodes in the expert models. Structural similarity and functional equivalence are independent measures: high similarity does not guarantee matching outcomes, and vice versa.

What carries the argument

A controlled comparison of four LLM input conditions (raw legal text, text plus semantic role labels, text plus input/output constraints, and text plus both) measured by graph-kernel similarity to gold models and by functional equivalence after execution on test scenarios.

If this is right

  • Input and output constraints should be added to prompts when using LLMs to generate legal decision models.
  • LLMs can automatically simplify overly complex decision models by eliminating redundant pass-through logic.
  • Both structural similarity and outcome-based testing are required for validation because the two measures do not always agree.
  • The approach can be applied to other legal domains that have paired text and production decision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Emphasizing constraint specification over full semantic parsing may be the higher-leverage step in building legal automation pipelines.
  • The roughly 50 percent match rate on scenarios implies that human review remains necessary before deploying generated models in production.
  • The public dataset release allows direct testing of whether further prompt engineering or model scaling can close the remaining gap to gold-standard performance.
  • The same structured-representation technique could be tested on regulatory texts from other countries or sectors once comparable paired data becomes available.

Load-bearing premise

The pre-configured test scenarios adequately cover the decision space of the legal rules so that agreement on those scenarios indicates the generated models are correct for practical use.

What would settle it

Running the generated models on a fresh collection of test scenarios outside the original pre-configured set and observing frequent outcome mismatches with the gold models.

Figures

Figures reproduced from arXiv: 2604.17153 by David Graus.

Figure 1
Figure 1. Figure 1: The Outcome - GeluidProdWindturbine decision model. Green circles represent input nodes (boolean ques￾tions), blue squares are decision nodes, and the yellow square produces the final regulatory outcome. As an example, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GeluidProdWindturbine: High structural similarity [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AlarminstallatieHebben: Low structural similarity [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Transforming legal text into executable decision logic is a longstanding challenge in legal informatics. With the rise of LLMs, this task has gained renewed interest, but remains challenging due to requiring extensive manual coding and evaluation. We use a unique real-world dataset that pairs production-grade decision models with legal text from the Dutch Environment and Planning Act. These models power the Omgevingsloket government platform, where citizens check permit requirements for environmental activities. We study whether intermediate structured representations can improve LLM-based generation of executable decision models from legal text. We compare four input conditions: raw legal text, text enriched with semantic role labels, text enriched with input and output constraints, and text enriched with both. We evaluate along two dimensions: structural evaluation, through similarity to gold decision models with graph kernels and graphs' descriptive statistics, and outcome evaluation, through functional equivalence by executing models on pre-configured test scenarios. Our findings show that I/O constraints provide the dominant improvement (+37-54% similarity over baseline), while semantic role labels show modest improvements. Outcome evaluation shows that generated models match the gold standard on 51-53% of test scenarios, even though generated models are typically smaller and simpler. We find LLMs eliminate redundant pass-through logic that comprises up to 45-55% of nodes. Importantly, structural similarity and outcome equivalence are complementary: structural similarity does not guarantee outcome equivalence, and vice versa. To facilitate reproducibility, we publicly release our dataset of 95 production decision models with associated legal text and all experimental code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript evaluates whether intermediate structured representations (semantic role labels and/or input/output constraints) improve LLM-based generation of executable decision models from legal text. Using a dataset of 95 production-grade models paired with Dutch Environment and Planning Act text, it compares four input conditions via structural similarity (graph kernels and descriptive statistics) and outcome equivalence (execution on pre-configured test scenarios). Key results: I/O constraints yield +37-54% structural similarity gains over baseline while semantic roles show modest gains; generated models achieve 51-53% scenario match rate despite being smaller and simpler (removing 45-55% pass-through nodes); structural and outcome metrics are complementary. The dataset and code are released for reproducibility.

Significance. If the evaluation is robust, the work provides actionable evidence on effective input structuring for legal decision model generation, with I/O constraints as the dominant factor. The public release of the unique real-world dataset of 95 models with associated legal text and all experimental code is a clear strength for reproducibility and future research in legal informatics and LLM applications to structured reasoning.

major comments (1)
  1. [Outcome evaluation] Outcome evaluation (as described in the abstract and results): The central claim that generated models demonstrate usable correctness via 51-53% functional equivalence on test scenarios is load-bearing, yet the paper reports no path-enumeration, coverage metric, or verification that the pre-configured scenarios exercise all decision paths in the 95 gold models. Given that generated models systematically remove 45-55% of nodes as redundant pass-through logic, this leaves open whether the match rate reflects true equivalence or incomplete scenario coverage.
minor comments (2)
  1. [Abstract] The abstract states that structural similarity and outcome equivalence are complementary but does not provide a concrete example or cross-tabulation showing cases where one holds and the other does not.
  2. [Methods] Clarify the exact number of test scenarios used and their derivation process, as this is referenced but not detailed in the provided summary of methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address the single major comment on outcome evaluation below and propose targeted revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Outcome evaluation (as described in the abstract and results): The central claim that generated models demonstrate usable correctness via 51-53% functional equivalence on test scenarios is load-bearing, yet the paper reports no path-enumeration, coverage metric, or verification that the pre-configured scenarios exercise all decision paths in the 95 gold models. Given that generated models systematically remove 45-55% of nodes as redundant pass-through logic, this leaves open whether the match rate reflects true equivalence or incomplete scenario coverage.

    Authors: We appreciate the referee highlighting this important methodological consideration. The pre-configured test scenarios originate from the production Omgevingsloket platform and were developed by domain experts to validate the 95 gold-standard models against real-world applications of the Dutch Environment and Planning Act. They therefore represent the functional behaviors that matter in practice rather than an exhaustive enumeration of all theoretically possible input combinations. We did not perform additional path-enumeration or coverage analysis because our primary goal was to compare the four input conditions (raw text, semantic roles, I/O constraints, and both) rather than to certify complete logical equivalence. The reported 51-53% scenario match rate is presented as evidence of functional agreement on these production-relevant cases, not as proof of equivalence across every possible path. We explicitly note in the manuscript that generated models are smaller and remove redundant pass-through logic (45-55% of nodes), and we already emphasize that structural similarity and outcome equivalence are complementary metrics precisely because one does not guarantee the other. To address the referee's concern directly, we will revise the manuscript to: (1) add a brief description of the provenance and intended coverage of the test scenarios, (2) qualify the outcome results as agreement on the provided scenarios rather than full equivalence, and (3) expand the limitations discussion to acknowledge the absence of exhaustive path coverage. These changes will clarify the scope of our claims without altering the core findings or requiring new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against external gold models

full rationale

The paper reports an experimental comparison of four input conditions (raw text, SRL-enriched, I/O-constraint-enriched, and combined) for LLM-based generation of decision models. Evaluation uses direct structural similarity metrics (graph kernels and descriptive statistics) and functional equivalence on pre-configured test scenarios against an external dataset of 95 production-grade gold models. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The central results (I/O constraints yielding +37-54% similarity; 51-53% scenario match) rest on measurements to independent gold data, satisfying the criteria for a self-contained empirical study with no reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

This is an applied empirical study in legal informatics relying on standard assumptions in machine learning evaluation and decision modeling rather than new theoretical constructs.

axioms (3)
  • domain assumption The gold decision models accurately represent the legal requirements.
    Used as baseline for similarity and equivalence checks.
  • domain assumption Graph kernels and descriptive statistics are appropriate measures for comparing decision model structures.
    Basis for structural evaluation.
  • domain assumption Executing models on pre-configured test scenarios reliably indicates functional correctness.
    Basis for outcome evaluation.

pith-pipeline@v0.9.0 · 5571 in / 1364 out tokens · 53204 ms · 2026-05-10T06:05:40.940819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Tara Athan, Harold Boley, Guido Governatori, Monica Palmirani, Adrian Paschke, and Adam Wyner. 2013. OASIS LegalRuleML. InICAIL. doi:10.1145/2514601. 2514603

  2. [2]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 https://arxiv.org/abs/2108.07732

  3. [3]

    van Drie, Maaike de Boer, Robert van Doesburg, and Tom van Engers

    Roos Bakker, Romy A.N. van Drie, Maaike de Boer, Robert van Doesburg, and Tom van Engers. 2022. Semantic Role Labelling for Dutch Law Texts. InLREC. 448–457

  4. [4]

    Bakker, Akke J

    Roos M. Bakker, Akke J. Schoevers, Romy A. N. van Drie, Marijn P. Schraagen, and Maaike H. T. de Boer. 2025. Semantic role extraction in law texts: a comparative analysis of language models for legal information extraction.Artificial Intelligence and Law(2025). doi:10.1007/s10506-025-09437-x

  5. [5]

    Marco Billi, Giuseppe Pisano, and Marco Sanchi. 2024. Fighting the Knowledge Representation Bottleneck with Large Language Models. InJURIX. 14–24. doi:10. 3233/FAIA241230

  6. [6]

    Borgwardt and H.P

    K.M. Borgwardt and H.P. Kriegel. 2005. Shortest-path kernels on graphs. InICDM. doi:10.1109/ICDM.2005.132

  7. [7]

    Jeroen Breteler, Thom van Gessel, Giulia Biagioni, and Robert van Doesburg. 2023. The FLINT Ontology: An Actor-Based Model of Legal Relations. InSEMANTICS. doi:10.3233/SSW230016

  8. [8]

    Julien Breton, Mokhtar Mokhtar Billami, Max Chevalier, Ha Thanh Nguyen, Ken Satoh, Cassia Trojahn, and May Myo Zin. 2025. Leveraging LLMs for legal terms extraction with limited annotated data.Artificial Intelligence and Law(2025). doi:10.1007/s10506-025-09448-8

  9. [9]

    Chaitra C R, Sankalp Kulkarni, Sai Rama Akash Varma Sagi, Shashank Pandey, Rohit Yalavarthy, Dipanjan Chakraborty, and Prajna Devi Upadhyay. 2024. LeGen: Complex Information Extraction from Legal Sentences using Generative Models. InNLLP Workshop. 1–17. doi:10.18653/v1/2024.nllp-1.1

  10. [10]

    Lost in EU Regulation? Don’t Worry, AI Found the Obligation

    Thiago Dal Pont, Federico Galli, Galileo Sartor, and Giuseppe Contissa. 2025. “Lost in EU Regulation? Don’t Worry, AI Found the Obligation”: Extracting and Representing Legal Obligations in the GDPR the DSA and the AI Act. InICAIL. 121–130. doi:10.1145/3769126.3769260

  11. [11]

    Emile de Maat and Radboud Winkels. 2009. A Next Step Towards Automated Modelling of Sources of Law. InICAIL. 31–39. doi:10.1145/1568234.1568239

  12. [12]

    Vedavyas Etikala, Ziboud Van Veldhoven, and Jan Vanthienen. 2020. Text2Dec: Extracting Decision Dependencies from Natural Language Text for Automated DMN Decision Modelling. InBPM Workshops. 367–379

  13. [13]

    Alexandre Goossens, Johannes De Smedt, and Jan Vanthienen. 2023. Extracting Decision Model and Notation models from text using deep learning techniques. Expert Systems with Applications211 (2023), 118667. doi:10.1016/j.eswa.2022. 118667

  14. [14]

    Alexandre Goossens, Johannes De Smedt, and Jan Vanthienen. 2024. Comparing the Performance of GPT-3 with BERT for Decision Requirements Modeling. In CoopIS. doi:10.1007/978-3-031-46846-9_26

  15. [15]

    Morgan Gray, Jaromir Savelka, Wesley Oliver, and Kevin D. Ashley. 2024. Using LLMs to Discover Legal Factors. InJURIX. 60–71. doi:10.3233/FAIA241234

  16. [16]

    Llio Humphreys, Guido Boella, Luigi Di Caro, Livio Robaldo, Leon van der Torre, Sepideh Ghanavati, and Robert Muthuri. 2020. Populating Legal Ontologies using Semantic Role Labeling. InLREC. https://aclanthology.org/2020.lrec-1.264/

  17. [17]

    Samyar Janatian, Hannes Westermann, Jinzhe Tan, Jaromir Savelka, and Karim Benyekhlef. 2023. From Text to Structure: Using Large Language Models to Support the Development of Legal Expert Systems. InJURIX. 167–176. doi:10. 3233/FAIA230962

  18. [18]

    Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, and Philipp Leitner. 2025. The Impact of Prompt Programming on Function-Level Code Generation.IEEE Transactions on Software Engineering51, 8 (2025), 2381–2395. doi:10.1109/TSE.2025.3587794

  19. [19]

    Harry Nan, Maarten Marx, and Johan Wolswinkel. 2024. Combining rule-based and machine learning methods for efficient information extraction from enforce- ment decisions. InJURIX. doi:10.3233/FAIA241262

  20. [20]

    Marcos Pertierra, Sarah Lawsky, Erik Hemberg, and Una-May O’Reilly. 2017. Towards Formalizing Statute Law as Default Logic through Automatic Semantic Parsing. InASAIL Workshop @ ICAIL

  21. [21]

    Giovanni Pinna, Yuriy Perezhohin, Luca Manzoni, Mauro Castelli, and Andrea De Lorenzo. 2025. Redefining text-to-SQL metrics by incorporating semantic and structural similarity.Scientific Reports(2025). doi:10.1038/s41598-025-04890-9

  22. [22]

    ACM29, 5 (May 1986), 370–386

    Marek J. Sergot, Fariba Sadri, Robert A. Kowalski, Frank Kriwaczek, Peter Ham- mond, and H. T. Cory. 1986. The British Nationality Act as a Logic Program. Commun. ACM29, 5 (1986), 370–386. doi:10.1145/5689.5920

  23. [23]

    Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt

    Nino Shervashidze, S.V.N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient Graphlet Kernels for Large Graph Comparison. InAISTATS. 488–495

  24. [24]

    Anmol Singhal and Travis Breaux. 2025. Legal Requirements Translation from Law. InIEEE RE. 205–217. doi:10.1109/RE63999.2025.00028

  25. [25]

    Romy A. N. van Drie, Maaike H. T. de Boer, Roos M. Bakker, Ioannis Tolios, and Daan Vos. 2023. The Dutch Law as a Semantic Role Labeling Dataset. InICAIL. doi:10.1145/3594536.3595124

  26. [26]

    Sinh Trong Vu, Minh Le Nguyen, and Ken Satoh. 2022. Abstract meaning repre- sentation for legal documents.Artificial Intelligence and Law30, 2 (2022), 221–243. doi:10.1007/s10506-021-09292-6

  27. [27]

    May Myo Zin, Georg Borges, Ken Satoh, and Wachara Fungwacharakorn. 2025. Towards Machine-Readable Traffic Laws: Formalizing Traffic Rules into PROLOG Using LLMs. InICAIL. doi:10.1145/3769126.3769222