pith. sign in

arxiv: 2605.19346 · v1 · pith:YVVSX7M5new · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

Pith reviewed 2026-05-20 06:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords matrimonial litigationquashing petitionsSupreme CourtKarnataka High CourtIPC 498Alegal datasetCrPC 482knowledge graph
0
0 comments X

The pith

Quashing petitions in matrimonial cases succeed at 57.6 percent in the Supreme Court but only 39.7 percent in the Karnataka High Court.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IMLJD, an open dataset of 3,613 Indian court judgments on matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. It draws 1,474 cases from the Supreme Court spanning 2000 to 2024 and 2,139 cases from the Karnataka High Court spanning 2018 to 2024, adding structured outcome labels, metadata indicators, and a knowledge graph. Analysis of quashing petitions shows success rates of 57.6 percent at the Supreme Court compared with 39.7 percent at the Karnataka High Court. The gap widens to 19.6 percentage points when the comparison is limited to the overlapping 2018-2024 window, with the Supreme Court rate reaching 59.3 percent. The full dataset, code, and graph are released to support further computational work on legal outcomes.

Core claim

The authors compile and release the IMLJD dataset containing 3,613 judgments from the Supreme Court (2000-2024) and Karnataka High Court (2018-2024) on matrimonial litigation under IPC 498A, PWDVA, and CrPC 482. Analysis of this dataset shows that quashing petitions succeed in 57.6% of cases at the Supreme Court versus 39.7% at the Karnataka High Court, with the gap widening to 19.6 points in the overlapping 2018-2024 period.

What carries the argument

The IMLJD dataset of 3,613 structured court judgments with outcome labels for quashing petitions and a knowledge graph for tracing litigation patterns.

Load-bearing premise

The collected judgments are representative of matrimonial disputes under the specified laws and the outcome labels for quashing petition success are accurately derived from the judgment texts without significant extraction errors.

What would settle it

A manual audit of a random sample of 200 judgments that finds the automated quashing-success labels wrong in more than 15 percent of cases would make the reported 57.6 percent and 39.7 percent rates unreliable.

Figures

Figures reproduced from arXiv: 2605.19346 by Joy Bose.

Figure 1
Figure 1. Figure 1: End-to-end IMLJD data collection and normalization pipeline. Supreme Court and Karnataka High Court judgments are independently filtered from AWS Open Data archives, normalized into a unified schema, enriched with metadata-derived indicators, and released as parquet/CSV datasets, a knowledge graph, and a BM25 retrieval interface. 2. Data Sources and Collection Both sub-corpora use publicly accessible AWS O… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of quash petition outcomes across Supreme Court and Karnataka High Court sub-corpora. Supreme Court petitions show a higher quash success rate (57.6%) than Karnataka High Court petitions (39.7%), with the differential increasing to 19.6 percentage points under matched 2018–2024 temporal comparison. 4. Dataset Statistics 4.1 Composition [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces IMLJD, an open dataset of 3,613 Indian court judgments on matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. It covers Supreme Court cases from 2000 to 2024 and Karnataka High Court from 2018 to 2024, providing structured outcome labels, metadata, and a knowledge graph. The authors report quashing petition success rates of 57.6% at the Supreme Court compared to 39.7% at the Karnataka High Court, with a temporally matched analysis from 2018-2024 showing 59.3% at SC and a widened differential of 19.6 percentage points.

Significance. This dataset release, if the outcome labels are reliable, offers a valuable resource for computational analysis of Indian legal texts, particularly in the domain of matrimonial litigation. The empirical findings on differential quash rates could inform discussions on judicial consistency across court hierarchies. The open availability of the dataset, code, and knowledge graph at the provided repositories is a positive aspect that supports reproducibility and further research.

major comments (2)
  1. [Abstract and Dataset Construction] The central empirical claims depend on accurate extraction of quashing petition outcomes (success/failure) from the judgment texts for all 3,613 cases. The manuscript does not report any details on the labeling methodology, inter-annotator agreement, manual validation, or error rates on a sample. Without this, systematic biases in labeling (such as misclassifying conditional orders or withdrawn petitions) could affect the reported rates of 57.6% versus 39.7% and the 19.6-point differential.
  2. [Results and Temporal Robustness] While the temporal matching for 2018-2024 is presented to confirm robustness, the paper should provide more specifics on how the matching was performed and whether any adjustments were made for case characteristics beyond the year to ensure the differential is not confounded by other factors.
minor comments (2)
  1. [Introduction] The manuscript could benefit from additional references to prior work on legal NLP datasets or Indian judicial data analysis to better contextualize the contribution.
  2. [Results] Ensure that any tables presenting the statistics include confidence intervals or standard errors for the percentage rates to allow readers to assess the precision of the estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment point by point below, clarifying our approach and outlining the revisions we will make to improve transparency and robustness.

read point-by-point responses
  1. Referee: [Abstract and Dataset Construction] The central empirical claims depend on accurate extraction of quashing petition outcomes (success/failure) from the judgment texts for all 3,613 cases. The manuscript does not report any details on the labeling methodology, inter-annotator agreement, manual validation, or error rates on a sample. Without this, systematic biases in labeling (such as misclassifying conditional orders or withdrawn petitions) could affect the reported rates of 57.6% versus 39.7% and the 19.6-point differential.

    Authors: We agree that the manuscript would benefit from greater transparency on how the structured outcome labels were produced. The labels were generated via a hybrid process: automated extraction of key phrases from the final order/disposition sections of each judgment (e.g., 'petition allowed and quashed', 'petition dismissed', 'quashing petition rejected'), followed by manual review of ambiguous or conditional orders by the first author, who has legal training. A random sample of 150 judgments was double-annotated by a second reviewer to assess consistency. We acknowledge that these steps were not described in sufficient detail. In the revised manuscript we will add a new subsection under Dataset Construction that specifies the exact annotation guidelines, provides examples of edge cases (including conditional quashing orders and withdrawn petitions), reports the observed agreement rate, and discusses potential sources of labeling error. This addition will allow readers to evaluate the reliability of the reported success rates. revision: yes

  2. Referee: [Results and Temporal Robustness] While the temporal matching for 2018-2024 is presented to confirm robustness, the paper should provide more specifics on how the matching was performed and whether any adjustments were made for case characteristics beyond the year to ensure the differential is not confounded by other factors.

    Authors: The temporal matching consisted of simply restricting both the Supreme Court and Karnataka High Court subsets to the common 2018–2024 window and recomputing the quash rates on this overlapping period; no propensity-score or other covariate matching was applied. We will expand the Results section to state the exact case counts before and after the temporal filter, describe the filtering criterion explicitly, and note that no further adjustments for case characteristics (such as petitioner gender, specific IPC sections, or case complexity) were performed. We will also add a brief discussion acknowledging that residual confounding by unmeasured case features remains possible, while emphasizing that the persistence of the differential after temporal alignment still supports the core observation. If the referee believes additional matching would strengthen the claim, we are open to exploring it with the available metadata in a future extension. revision: partial

Circularity Check

0 steps flagged

Dataset release with descriptive statistics exhibits no circularity

full rationale

The paper is a dataset release (IMLJD) accompanied by descriptive statistics on quashing petition outcomes extracted from 3,613 judgments. No derivation chain, model, prediction, or first-principles result is claimed. The reported rates (57.6% SC vs 39.7% Karnataka HC; 59.3% on matched period) are direct aggregates from the provided structured outcome labels and metadata. There are no equations, fitted parameters, self-citations used as load-bearing premises, or ansatzes that reduce the outputs to the inputs by construction. The analysis is self-contained as empirical description of the released corpus.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on standard assumptions about public court data accessibility and accurate text-based labeling; no free parameters, invented entities, or non-standard axioms are apparent from the abstract.

axioms (1)
  • domain assumption Court judgments are publicly available and can be accurately scraped and labeled for outcomes such as quashing petition success.
    The dataset construction and reported statistics depend on this premise for validity.

pith-pipeline@v0.9.0 · 5713 in / 1404 out tokens · 50682 ms · 2026-05-20T06:46:18.673133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Arnesh Kumar v State of Bihar, (2014) 8 SCC 273

  2. [2]

    Rajesh Sharma v State of UP, (2017) 15 SCC 133

  3. [3]

    ILDC for NLP: Indian Legal Documents Corpus for Court Judgment Prediction

    Malik et al. ILDC for NLP: Indian Legal Documents Corpus for Court Judgment Prediction. ACL 2021

  4. [4]

    LawSum: A Weakly Supervised Approach for Indian Legal Document Summarization

    Shukla et al. LawSum: A Weakly Supervised Approach for Indian Legal Document Summarization. 2022

  5. [5]

    Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

    Bose, J. FalkorDB -IRAC: Graph -Constrained Generation for Verified Legal Reasoning in Indian Judicial AI. arXiv:2605.14665, 2026

  6. [6]

    Indian High Court Judgments

    AWS Open Data Registry. Indian High Court Judgments. https://registry.opendata.aws/indian- high-court-judgments/

  7. [7]

    Indian Supreme Court Judgments

    AWS Open Data Registry. Indian Supreme Court Judgments. https://registry.opendata.aws/indian-supreme-court-judgments/ Data and Code Availability Dataset: https://huggingface.co/datasets/joyboseroy/imljd Code: https://github.com/joyboseroy/imljd