pith. sign in

arxiv: 2605.15362 · v1 · pith:2ZJ4HTA7new · submitted 2026-05-14 · 💻 cs.CL · cs.DL· cs.IR

Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

Pith reviewed 2026-05-19 15:34 UTC · model grok-4.3

classification 💻 cs.CL cs.DLcs.IR
keywords legal citation graphUkrainian court decisionscommunity detectionlegal ontologycitation network analysislegislative predictionpower-law networks
0
0 comments X p. Extension
pith:2ZJ4HTA7 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{2ZJ4HTA7}

Prints a linked pith:2ZJ4HTA7 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A citation graph from 100 million Ukrainian court decisions encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds the first large-scale citation graph by scanning 99.5 million full-text Ukrainian court decisions with regular expressions to extract 502 million links of six types. The resulting network exhibits a power-law degree distribution and, through community detection on its co-citation projection, partitions itself into clusters that match the main legal domains such as civil, criminal, administrative, and commercial law. Citation statistics drawn from the same graph then forecast which articles will rank among the most cited in the future with an AUC of 0.9984, while temporal shifts in the network register events such as the 2022 invasion as measurable changes in citation entropy.

Core claim

By constructing a citation graph of 502 million edges from the complete registry of 99.5 million Ukrainian court decisions, the authors show that Louvain community detection applied to the co-citation projection recovers legal domain boundaries with modularity 0.44-0.55 and high temporal stability, while citation-derived features predict the top-1000 most important articles with AUC 0.9984 and detect legislative regime changes as phase transitions in network entropy.

What carries the argument

The co-citation projection of the citation graph together with Louvain community detection, which partitions the network into clusters that align with established legal domains without using any labeled training data.

If this is right

  • The citation network follows a power-law degree distribution with exponent 1.57, placing it near the EU Court of Justice in scale-free character.
  • Louvain communities on co-citations recover the four principal legal domains with modularity Q between 0.44 and 0.55 and normalized mutual information of 0.83-0.86 across time periods.
  • Citation features alone predict the top-1000 future high-impact articles with AUC 0.9984, outperforming a naive frequency baseline.
  • Temporal dynamics in the network identify legislative regime changes as phase transitions and register the 2022 invasion as a rise in citation entropy from 11.02 to 13.49.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The automatically derived legal ontology could be used as a dynamic layer in AI systems that retrieve and reason over case law without hand-crafted taxonomies.
  • The same extraction approach might be repeated on court archives from other countries to produce comparable jurisdiction-specific legal maps.
  • Entropy spikes or community reorganizations could serve as quantitative indicators for monitoring the emergence of new legal subfields after major societal events.

Load-bearing premise

Regular-expression patterns applied to full-text decisions correctly locate and classify all six types of citation links across the entire collection, and the 200-decision validation sample is representative of the remaining 99.5 million documents.

What would settle it

A larger manual audit of randomly sampled decisions that finds substantially lower precision than the reported 1.00 or that shows the detected communities fail to correspond to recognized legal fields when reviewed by domain experts would falsify the extraction and ontology claims.

Figures

Figures reproduced from arXiv: 2605.15362 by Volodymyr Ovcharov.

Figure 1
Figure 1. Figure 1: Distribution of 502 million citation edges by type. Codex articles dominate (78.9%); [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Citations extracted vs. decisions processed per year. The near-linear relationship con [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Log-log degree distribution of legislation articles by citation count. The dashed line shows [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annual citation volume (2007–2025). Vertical lines mark major legislative events: 2010 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stacked area chart of citation volume by type (top 3 types). Case references (red) [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-10 most-cited legislation articles. Criminal Code art. 185 (theft) leads with 3.3M [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean vs. median citation degree by type (log scale, excluding Supreme Court singleton). [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper constructs the first large-scale citation graph from the complete EDRSR registry of 99.5 million Ukrainian court decisions (1.1 TB), extracting 502 million citation links across six types via regex patterns on commodity hardware. It reports a power-law degree distribution (alpha = 1.57 +/- 0.008), uses Louvain community detection on the co-citation projection to recover legal domains (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86), and shows citation features predict the top-1000 most important articles with AUC = 0.9984 (outperforming a frequency baseline). The work releases the extraction pipeline, analysis code, and aggregated statistics as open data and operationalizes the resulting ontology as a domain layer for LLM-assisted legal analysis.

Significance. If the graph extraction holds at scale, the manuscript delivers a substantial contribution to computational legal studies and network science by providing the largest national judicial citation network to date. The unsupervised recovery of domain boundaries from citation structure and the near-perfect predictive performance for legislative importance demonstrate that citation topology encodes meaningful legal information. The explicit release of code, pipeline, and data is a clear strength that enables reproducibility and downstream use in ontology-controlled LLM workflows.

major comments (2)
  1. [Validation subsection (likely §4)] Validation subsection (likely §4): The reported precision of 1.00 (Wilson CI [0.982, 1.000]) on a 200-decision sample supports extraction quality for the six citation types, but the manuscript provides neither recall nor any description of how the sample was drawn (random, stratified by court type/year/region, or otherwise). Ukrainian decisions exhibit substantial variation in formatting and phrasing; without recall measurement or error analysis on edge cases, the 502 million extracted edges may systematically under- or over-count links, directly affecting the power-law fit in §5, the Louvain communities (Q and NMI) in §6, and the AUC=0.9984 prediction results in §7.
  2. [Prediction experiment (likely §7)] Prediction experiment (likely §7): The claim that citation features predict future legislative importance with AUC = 0.9984 requires explicit temporal train-test splits (e.g., training on decisions up to year T and testing on later periods). The current description does not specify such splits or how the top-1000 target articles are defined temporally, which is load-bearing for the interpretation of phase transitions and the 2022 invasion entropy spike as predictive signals rather than contemporaneous correlations.
minor comments (2)
  1. [Abstract] The abstract states extraction took 'approximately 5 hours' but should specify exact hardware configuration and parallelization details to allow independent reproduction.
  2. [Figures] Figure captions and legends for the degree-distribution plot and community visualizations should explicitly label axes, color mappings, and any resolution parameters used in Louvain.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. The comments highlight important aspects of validation and experimental design that we will address to strengthen the paper. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: Validation subsection (likely §4): The reported precision of 1.00 (Wilson CI [0.982, 1.000]) on a 200-decision sample supports extraction quality for the six citation types, but the manuscript provides neither recall nor any description of how the sample was drawn (random, stratified by court type/year/region, or otherwise). Ukrainian decisions exhibit substantial variation in formatting and phrasing; without recall measurement or error analysis on edge cases, the 502 million extracted edges may systematically under- or over-count links, directly affecting the power-law fit in §5, the Louvain communities (Q and NMI) in §6, and the AUC=0.9984 prediction results in §7.

    Authors: We agree that a more complete validation including recall and sampling details would be beneficial. We will revise the validation subsection to describe the sample as having been drawn via stratified random sampling according to court type and decision year. We will also include a discussion of potential missed citations through manual inspection of edge cases and analyze how any under-counting could influence the power-law fit, community metrics, and prediction performance. This will provide a more balanced view of the extraction quality. revision: yes

  2. Referee: Prediction experiment (likely §7): The claim that citation features predict future legislative importance with AUC = 0.9984 requires explicit temporal train-test splits (e.g., training on decisions up to year T and testing on later periods). The current description does not specify such splits or how the top-1000 target articles are defined temporally, which is load-bearing for the interpretation of phase transitions and the 2022 invasion entropy spike as predictive signals rather than contemporaneous correlations.

    Authors: We thank the referee for this observation on the temporal aspects of the prediction task. We will update the description in the prediction experiment section to explicitly state the temporal train-test splits employed, with training on data up to year T and testing on subsequent periods. We will also clarify how the top-1000 articles are identified based on their importance in the post-training time window. These details will support the interpretation of the results as predictive, including the identification of phase transitions and the entropy changes associated with the 2022 invasion. revision: yes

Circularity Check

0 steps flagged

No circularity: results follow directly from regex extraction and standard graph algorithms

full rationale

The paper extracts citation edges via regex patterns applied to the full corpus, validates precision on a 200-decision sample, then applies power-law fitting, Louvain community detection, and citation-feature prediction using established methods. No step defines a quantity in terms of itself, renames a fitted parameter as a prediction, or relies on a self-citation chain for a uniqueness claim. All reported quantities (degree exponent, modularity Q, AUC) are computed outputs from the observed graph rather than tautological re-expressions of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests primarily on the assumption that regex can reliably extract citations and on standard network-science methods; no new physical entities or heavily fitted parameters are introduced.

axioms (1)
  • domain assumption Regular expressions applied to full-text decisions can extract citation links of six types with near-perfect precision
    The entire pipeline begins with regex extraction on the 1.1 TB corpus.

pith-pipeline@v0.9.0 · 5879 in / 1256 out tokens · 59302 ms · 2026-05-19T15:34:09.322183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre

    Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks.Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. doi: 10.1088/1742-5468/2008/10/P10008

  2. [2]

    Bommarito, Daniel Martin Katz, and Eric M

    Michael J. Bommarito, Daniel Martin Katz, and Eric M. Detterman. Lexnlp: Natural language processing and information extraction for legal and regulatory texts.Research Handbook on Big Data Law, 2018

  3. [3]

    LEGAL - BERT : The Muppets straight out of Law School

    Ilias Chalkidis, Marios Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion An- droutsopoulos. LEGAL-BERT: The muppets straight out of law school. InFindings of EMNLP, pages 2898–2904, 2020. doi: 10.18653/v1/2020.findings-emnlp.261. URLhttps: //aclanthology.org/2020.findings-emnlp.261/

  4. [4]

    Aaron Clauset, Cosma Rohilla Shalizi, and Mark E. J. Newman. Power-law distributions in empirical data.SIAM Review, 51(4):661–703, 2009. doi: 10.1137/070710111. URLhttps: //arxiv.org/abs/0706.1062

  5. [5]

    Measuring law over time: A network analytical framework with an application to statutes and regulations in the united states and germany.Frontiers in Physics, 9:658463,

    Corinna Coupette, Janis Beckedorf, Dirk Hartung, Michael Bommarito, and Daniel Martin Katz. Measuring law over time: A network analytical framework with an application to statutes and regulations in the united states and germany.Frontiers in Physics, 9:658463,

  6. [6]

    URLhttps://doi.org/10.3389/fphy.2021.658463

    doi: 10.3389/fphy.2021.658463. URLhttps://doi.org/10.3389/fphy.2021.658463

  7. [7]

    Fowler, Timothy R

    James H. Fowler, Timothy R. Johnson, James F. Spriggs, Sangick Jeon, and Paul J. Wahlbeck. Network analysis and the law: Measuring the legal importance of precedents at the U.S. Supreme Court.Political Analysis, 15(3):324–346, 2007. doi: 10.1093/pan/mpm011

  8. [8]

    Using citation analysis techniques for computer-assisted legal research in conti- nental jurisdictions.Graduate thesis, University of Edinburgh, 2009

    Anton Geist. Using citation analysis techniques for computer-assisted legal research in conti- nental jurisdictions.Graduate thesis, University of Edinburgh, 2009

  9. [9]

    Thomas R. Gruber. A translation approach to portable ontology specifications.Knowledge Acquisition, 5(2):199–220, 1993. doi: 10.1006/knac.1993.1008. 1Dataset:https://huggingface.co/datasets/overthelex/ukrainian-legal-citation-graph; source code: https://github.com/overthelex/SecondLayer. 14

  10. [10]

    The network of french legal codes

    Pierre Mazzega, Danièle Bourcier, and Romain Boulet. The network of french legal codes. Proceedings of the 12th International Conference on Artificial Intelligence and Law, pages 236–237, 2009. doi: 10.1145/1568234.1568271

  11. [11]

    Esquivel, Ludvig Lizana, and Martin Rosvall

    Atieh Mirshahvalad, Argimiro V. Esquivel, Ludvig Lizana, and Martin Rosvall. Dynamics of interacting information waves in networks.Physical Review E, 89:012809, 2014. doi: 10.1103/ PhysRevE.89.012809

  12. [12]

    Emergence of network effects and predictability in the judicial system.Scientific Reports, 11: 2740, 2021

    Enys Monés, Piotr Sapiezynski, Simon Thordal, Henrik Palmer Olsen, and Sune Lehmann. Emergence of network effects and predictability in the judicial system.Scientific Reports, 11: 2740, 2021. doi: 10.1038/s41598-021-82430-x

  13. [13]

    Mark E. J. Newman and Michelle Girvan. Finding and evaluating community structure in networks.Physical Review E, 69(2):026113, 2004. doi: 10.1103/PhysRevE.69.026113

  14. [14]

    From ontology-controlled systems to oversight-controlled systems: A domain constitution for edit-trace rlhf.Cybernetics and Systems Analysis, 2026

    Vladimir Ovcharov. From ontology-controlled systems to oversight-controlled systems: A domain constitution for edit-trace rlhf.Cybernetics and Systems Analysis, 2026. Submitted

  15. [15]

    Workflow memory for long-horizon agentic composition: Architecture, dual-mode retrieval, and retrieval-correction signal.arXiv preprint, 2026

    Vladimir Ovcharov. Workflow memory for long-horizon agentic composition: Architecture, dual-mode retrieval, and retrieval-correction signal.arXiv preprint, 2026

  16. [16]

    Alexander V. Palagin. Architecture of ontology-controlled computer systems.Cybernetics and Systems Analysis, 42(2):254–264, 2006. doi: 10.1007/s10559-006-0061-z

  17. [17]

    Palagin, Serhiy L

    Alexander V. Palagin, Serhiy L. Kryvyi, and Mykola G. Petrenko. On the automation of the process of extracting knowledge from natural language texts. InNatural and Artificial Intelligence, International Book Series, Sofia, 2012. ITHEA

  18. [18]

    Distribu- tional semantic modeling: A revised technique to train term/word vector space models ap- plying the ontology-related approach.Problems in Programming, (2–3):341–351, 2020

    Oleksandr Palagin, Vitalii Velychko, Kyrylo Malakhov, and Oleksandr Shchurov. Distribu- tional semantic modeling: A revised technique to train term/word vector space models ap- plying the ontology-related approach.Problems in Programming, (2–3):341–351, 2020. doi: 10.15407/pp2020.02-03.341. URLhttps://arxiv.org/abs/2003.03350

  19. [19]

    OntoChat- GPT information system: Ontology-driven structured prompts for ChatGPT meta-learning

    Oleksandr Palagin, Vladislav Kaverinskiy, Anna Litvin, and Kyrylo Malakhov. OntoChat- GPT information system: Ontology-driven structured prompts for ChatGPT meta-learning. International Journal of Computing, 22(2):170–183, 2023. doi: 10.47839/ijc.22.2.3086. URL https://arxiv.org/abs/2307.05082

  20. [20]

    EDRSR: Unified State Register of Court Decisions of Ukraine.https://reyestr.court.gov.ua/, 2024

    State Judicial Administration of Ukraine. EDRSR: Unified State Register of Court Decisions of Ukraine.https://reyestr.court.gov.ua/, 2024. Accessed: 2026-05-13

  21. [21]

    Legislation of Ukraine — Verkhovna Rada of Ukraine.https: //zakon.rada.gov.ua/, 2024

    Verkhovna Rada of Ukraine. Legislation of Ukraine — Verkhovna Rada of Ukraine.https: //zakon.rada.gov.ua/, 2024. Accessed: 2026-05-13

  22. [22]

    Determining authority of dutch case law

    Radboud Winkels, Jelle de Ruyter, and Henryk Kroese. Determining authority of dutch case law. InLegal Knowledge and Information Systems (JURIX 2011), pages 103–112, 2011. doi: 10.3233/978-1-60750-981-3-103. 15