Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering
Pith reviewed 2026-05-19 15:34 UTC · model grok-4.3
pith:2ZJ4HTA7 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{2ZJ4HTA7}
Prints a linked pith:2ZJ4HTA7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A citation graph from 100 million Ukrainian court decisions encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a citation graph of 502 million edges from the complete registry of 99.5 million Ukrainian court decisions, the authors show that Louvain community detection applied to the co-citation projection recovers legal domain boundaries with modularity 0.44-0.55 and high temporal stability, while citation-derived features predict the top-1000 most important articles with AUC 0.9984 and detect legislative regime changes as phase transitions in network entropy.
What carries the argument
The co-citation projection of the citation graph together with Louvain community detection, which partitions the network into clusters that align with established legal domains without using any labeled training data.
If this is right
- The citation network follows a power-law degree distribution with exponent 1.57, placing it near the EU Court of Justice in scale-free character.
- Louvain communities on co-citations recover the four principal legal domains with modularity Q between 0.44 and 0.55 and normalized mutual information of 0.83-0.86 across time periods.
- Citation features alone predict the top-1000 future high-impact articles with AUC 0.9984, outperforming a naive frequency baseline.
- Temporal dynamics in the network identify legislative regime changes as phase transitions and register the 2022 invasion as a rise in citation entropy from 11.02 to 13.49.
Where Pith is reading between the lines
- The automatically derived legal ontology could be used as a dynamic layer in AI systems that retrieve and reason over case law without hand-crafted taxonomies.
- The same extraction approach might be repeated on court archives from other countries to produce comparable jurisdiction-specific legal maps.
- Entropy spikes or community reorganizations could serve as quantitative indicators for monitoring the emergence of new legal subfields after major societal events.
Load-bearing premise
Regular-expression patterns applied to full-text decisions correctly locate and classify all six types of citation links across the entire collection, and the 200-decision validation sample is representative of the remaining 99.5 million documents.
What would settle it
A larger manual audit of randomly sampled decisions that finds substantially lower precision than the reported 1.00 or that shows the detected communities fail to correspond to recognized legal fields when reviewed by domain experts would falsify the extraction and ontology claims.
Figures
read the original abstract
Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper constructs the first large-scale citation graph from the complete EDRSR registry of 99.5 million Ukrainian court decisions (1.1 TB), extracting 502 million citation links across six types via regex patterns on commodity hardware. It reports a power-law degree distribution (alpha = 1.57 +/- 0.008), uses Louvain community detection on the co-citation projection to recover legal domains (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86), and shows citation features predict the top-1000 most important articles with AUC = 0.9984 (outperforming a frequency baseline). The work releases the extraction pipeline, analysis code, and aggregated statistics as open data and operationalizes the resulting ontology as a domain layer for LLM-assisted legal analysis.
Significance. If the graph extraction holds at scale, the manuscript delivers a substantial contribution to computational legal studies and network science by providing the largest national judicial citation network to date. The unsupervised recovery of domain boundaries from citation structure and the near-perfect predictive performance for legislative importance demonstrate that citation topology encodes meaningful legal information. The explicit release of code, pipeline, and data is a clear strength that enables reproducibility and downstream use in ontology-controlled LLM workflows.
major comments (2)
- [Validation subsection (likely §4)] Validation subsection (likely §4): The reported precision of 1.00 (Wilson CI [0.982, 1.000]) on a 200-decision sample supports extraction quality for the six citation types, but the manuscript provides neither recall nor any description of how the sample was drawn (random, stratified by court type/year/region, or otherwise). Ukrainian decisions exhibit substantial variation in formatting and phrasing; without recall measurement or error analysis on edge cases, the 502 million extracted edges may systematically under- or over-count links, directly affecting the power-law fit in §5, the Louvain communities (Q and NMI) in §6, and the AUC=0.9984 prediction results in §7.
- [Prediction experiment (likely §7)] Prediction experiment (likely §7): The claim that citation features predict future legislative importance with AUC = 0.9984 requires explicit temporal train-test splits (e.g., training on decisions up to year T and testing on later periods). The current description does not specify such splits or how the top-1000 target articles are defined temporally, which is load-bearing for the interpretation of phase transitions and the 2022 invasion entropy spike as predictive signals rather than contemporaneous correlations.
minor comments (2)
- [Abstract] The abstract states extraction took 'approximately 5 hours' but should specify exact hardware configuration and parallelization details to allow independent reproduction.
- [Figures] Figure captions and legends for the degree-distribution plot and community visualizations should explicitly label axes, color mappings, and any resolution parameters used in Louvain.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. The comments highlight important aspects of validation and experimental design that we will address to strengthen the paper. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: Validation subsection (likely §4): The reported precision of 1.00 (Wilson CI [0.982, 1.000]) on a 200-decision sample supports extraction quality for the six citation types, but the manuscript provides neither recall nor any description of how the sample was drawn (random, stratified by court type/year/region, or otherwise). Ukrainian decisions exhibit substantial variation in formatting and phrasing; without recall measurement or error analysis on edge cases, the 502 million extracted edges may systematically under- or over-count links, directly affecting the power-law fit in §5, the Louvain communities (Q and NMI) in §6, and the AUC=0.9984 prediction results in §7.
Authors: We agree that a more complete validation including recall and sampling details would be beneficial. We will revise the validation subsection to describe the sample as having been drawn via stratified random sampling according to court type and decision year. We will also include a discussion of potential missed citations through manual inspection of edge cases and analyze how any under-counting could influence the power-law fit, community metrics, and prediction performance. This will provide a more balanced view of the extraction quality. revision: yes
-
Referee: Prediction experiment (likely §7): The claim that citation features predict future legislative importance with AUC = 0.9984 requires explicit temporal train-test splits (e.g., training on decisions up to year T and testing on later periods). The current description does not specify such splits or how the top-1000 target articles are defined temporally, which is load-bearing for the interpretation of phase transitions and the 2022 invasion entropy spike as predictive signals rather than contemporaneous correlations.
Authors: We thank the referee for this observation on the temporal aspects of the prediction task. We will update the description in the prediction experiment section to explicitly state the temporal train-test splits employed, with training on data up to year T and testing on subsequent periods. We will also clarify how the top-1000 articles are identified based on their importance in the post-training time window. These details will support the interpretation of the results as predictive, including the identification of phase transitions and the entropy changes associated with the 2022 invasion. revision: yes
Circularity Check
No circularity: results follow directly from regex extraction and standard graph algorithms
full rationale
The paper extracts citation edges via regex patterns applied to the full corpus, validates precision on a 200-decision sample, then applies power-law fitting, Louvain community detection, and citation-feature prediction using established methods. No step defines a quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a self-citation chain for a uniqueness claim. All reported quantities (degree exponent, modularity Q, AUC) are computed outputs from the observed graph rather than tautological re-expressions of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Regular expressions applied to full-text decisions can extract citation links of six types with near-perfect precision
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The degree distribution follows a power law (alpha = 1.57 +/- 0.008)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks.Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. doi: 10.1088/1742-5468/2008/10/P10008
-
[2]
Bommarito, Daniel Martin Katz, and Eric M
Michael J. Bommarito, Daniel Martin Katz, and Eric M. Detterman. Lexnlp: Natural language processing and information extraction for legal and regulatory texts.Research Handbook on Big Data Law, 2018
work page 2018
-
[3]
LEGAL - BERT : The Muppets straight out of Law School
Ilias Chalkidis, Marios Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion An- droutsopoulos. LEGAL-BERT: The muppets straight out of law school. InFindings of EMNLP, pages 2898–2904, 2020. doi: 10.18653/v1/2020.findings-emnlp.261. URLhttps: //aclanthology.org/2020.findings-emnlp.261/
-
[4]
Aaron Clauset, Cosma Rohilla Shalizi, and Mark E. J. Newman. Power-law distributions in empirical data.SIAM Review, 51(4):661–703, 2009. doi: 10.1137/070710111. URLhttps: //arxiv.org/abs/0706.1062
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1137/070710111 2009
-
[5]
Corinna Coupette, Janis Beckedorf, Dirk Hartung, Michael Bommarito, and Daniel Martin Katz. Measuring law over time: A network analytical framework with an application to statutes and regulations in the united states and germany.Frontiers in Physics, 9:658463,
-
[6]
URLhttps://doi.org/10.3389/fphy.2021.658463
doi: 10.3389/fphy.2021.658463. URLhttps://doi.org/10.3389/fphy.2021.658463
-
[7]
James H. Fowler, Timothy R. Johnson, James F. Spriggs, Sangick Jeon, and Paul J. Wahlbeck. Network analysis and the law: Measuring the legal importance of precedents at the U.S. Supreme Court.Political Analysis, 15(3):324–346, 2007. doi: 10.1093/pan/mpm011
-
[8]
Anton Geist. Using citation analysis techniques for computer-assisted legal research in conti- nental jurisdictions.Graduate thesis, University of Edinburgh, 2009
work page 2009
-
[9]
Thomas R. Gruber. A translation approach to portable ontology specifications.Knowledge Acquisition, 5(2):199–220, 1993. doi: 10.1006/knac.1993.1008. 1Dataset:https://huggingface.co/datasets/overthelex/ukrainian-legal-citation-graph; source code: https://github.com/overthelex/SecondLayer. 14
-
[10]
The network of french legal codes
Pierre Mazzega, Danièle Bourcier, and Romain Boulet. The network of french legal codes. Proceedings of the 12th International Conference on Artificial Intelligence and Law, pages 236–237, 2009. doi: 10.1145/1568234.1568271
-
[11]
Esquivel, Ludvig Lizana, and Martin Rosvall
Atieh Mirshahvalad, Argimiro V. Esquivel, Ludvig Lizana, and Martin Rosvall. Dynamics of interacting information waves in networks.Physical Review E, 89:012809, 2014. doi: 10.1103/ PhysRevE.89.012809
work page 2014
-
[12]
Enys Monés, Piotr Sapiezynski, Simon Thordal, Henrik Palmer Olsen, and Sune Lehmann. Emergence of network effects and predictability in the judicial system.Scientific Reports, 11: 2740, 2021. doi: 10.1038/s41598-021-82430-x
-
[13]
Mark E. J. Newman and Michelle Girvan. Finding and evaluating community structure in networks.Physical Review E, 69(2):026113, 2004. doi: 10.1103/PhysRevE.69.026113
-
[14]
Vladimir Ovcharov. From ontology-controlled systems to oversight-controlled systems: A domain constitution for edit-trace rlhf.Cybernetics and Systems Analysis, 2026. Submitted
work page 2026
-
[15]
Vladimir Ovcharov. Workflow memory for long-horizon agentic composition: Architecture, dual-mode retrieval, and retrieval-correction signal.arXiv preprint, 2026
work page 2026
-
[16]
Alexander V. Palagin. Architecture of ontology-controlled computer systems.Cybernetics and Systems Analysis, 42(2):254–264, 2006. doi: 10.1007/s10559-006-0061-z
-
[17]
Alexander V. Palagin, Serhiy L. Kryvyi, and Mykola G. Petrenko. On the automation of the process of extracting knowledge from natural language texts. InNatural and Artificial Intelligence, International Book Series, Sofia, 2012. ITHEA
work page 2012
-
[18]
Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov, and Oleksandr Shchurov. Distribu- tional semantic modeling: A revised technique to train term/word vector space models ap- plying the ontology-related approach.Problems in Programming, (2–3):341–351, 2020. doi: 10.15407/pp2020.02-03.341. URLhttps://arxiv.org/abs/2003.03350
-
[19]
OntoChat- GPT information system: Ontology-driven structured prompts for ChatGPT meta-learning
Oleksandr Palagin, Vladislav Kaverinskiy, Anna Litvin, and Kyrylo Malakhov. OntoChat- GPT information system: Ontology-driven structured prompts for ChatGPT meta-learning. International Journal of Computing, 22(2):170–183, 2023. doi: 10.47839/ijc.22.2.3086. URL https://arxiv.org/abs/2307.05082
-
[20]
EDRSR: Unified State Register of Court Decisions of Ukraine.https://reyestr.court.gov.ua/, 2024
State Judicial Administration of Ukraine. EDRSR: Unified State Register of Court Decisions of Ukraine.https://reyestr.court.gov.ua/, 2024. Accessed: 2026-05-13
work page 2024
-
[21]
Legislation of Ukraine — Verkhovna Rada of Ukraine.https: //zakon.rada.gov.ua/, 2024
Verkhovna Rada of Ukraine. Legislation of Ukraine — Verkhovna Rada of Ukraine.https: //zakon.rada.gov.ua/, 2024. Accessed: 2026-05-13
work page 2024
-
[22]
Determining authority of dutch case law
Radboud Winkels, Jelle de Ruyter, and Henryk Kroese. Determining authority of dutch case law. InLegal Knowledge and Information Systems (JURIX 2011), pages 103–112, 2011. doi: 10.3233/978-1-60750-981-3-103. 15
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.