pith. sign in

arxiv: 2605.14665 · v2 · pith:56M3B4GSnew · submitted 2026-05-14 · 💻 cs.AI · cs.CL· cs.IR

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Pith reviewed 2026-05-19 16:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR
keywords legal AIIRAC frameworkknowledge graphIndian judiciaryverified generationhallucination preventioncourt judgmentsprecedent tracing
0
0 comments X

The pith

Legal answers from AI are accepted only when a valid path through an IRAC graph of Indian judgments supports every cited precedent and statute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Falkor-IRAC as a method that turns court judgments into nodes in a knowledge graph organized by the IRAC elements of issue, rule, analysis, and conclusion. At inference time an LLM output passes only if a Verifier Agent can trace a supporting path across precedent links, procedural transitions, and statutory references stored in the graph. This setup is meant to block the generation of invented citations and unsupported reasoning chains that appear in vector-based legal tools. The system also surfaces doctrinal conflicts between judgments as an explicit result rather than resolving them silently. Tests on a set of 51 Supreme Court judgments show the verifier approves accurate citations and rejects fabricated ones.

Core claim

The central claim is that generation of legal reasoning can be constrained by requiring every accepted output to correspond to a traceable path through an IRAC-structured graph of judgments. The graph encodes precedent relationships, procedural state changes, and statutory references from Supreme Court and High Court cases. A separate Verifier Agent performs the path check as a falsifiability test, and the framework reports doctrinal conflicts when paths lead to inconsistent rules. On the 51-judgment proof-of-concept corpus the verifier validated correct citations and rejected fabricated ones using graph-native measures such as path validity rate and hallucinated precedent rate.

What carries the argument

The IRAC knowledge graph, which stores judgments as nodes linked by precedent, procedural transitions, and statutory references, together with the Verifier Agent that accepts or rejects generated answers according to the existence of a supporting path.

If this is right

  • Any LLM-generated legal answer must correspond to a traceable path in the judgment graph or be rejected.
  • Doctrinal conflicts between different court decisions are reported as a direct output.
  • Evaluation relies on path validity and citation grounding rates instead of text similarity metrics.
  • The released InIRAC dataset of over 500 annotated judgments supports further testing of graph-constrained methods.
  • The approach separates legal reasoning from pure vector retrieval by enforcing explicit structural constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the graph covers only a fraction of existing judgments, the verifier could reject otherwise sound reasoning that draws on omitted cases.
  • The same path-tracing requirement could be adapted to common-law systems outside India by constructing comparable IRAC graphs.
  • Embedding the verifier inside public-facing legal assistance tools would reduce the chance that users receive answers with invented citations.
  • Direct comparisons against vector-only RAG on larger query sets would clarify whether the graph constraint improves grounding in practice.

Load-bearing premise

A valid path through the IRAC graph is enough to guarantee that the generated reasoning is accurate and faithful to the law rather than merely consistent with the graph's encoding of the chosen judgments.

What would settle it

A trial on additional judgments or live legal queries in which the Verifier Agent accepts an answer that cites a fabricated precedent absent from the graph or that contradicts the actual holdings in the stored cases.

Figures

Figures reproduced from arXiv: 2605.14665 by Joy Bose.

Figure 1
Figure 1. Figure 1: Why vector RAG fails in legal reasoning. The standard pipeline (left) retrieves by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Procedural State Machine for Indian Bail Matters. This diagram illustrates the "litigation [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: IRAC knowledge graph node types and relationship types. Left column shows nine [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Three-layer knowledge graph schema. Top layer (IRAC): the central Case node connects to LegalIssue, Rule, Statute/Section, and Outcome. Middle layer (procedural): ProceduralEvent nodes chained via PRECEDES and TRIGGERS. Bottom layer (precedent): Case nodes linked via CITES, OVERRULES, DISTINGUISHES, and the novel CONFLICTS_WITH and RESOLVED_BY edges that surface unresolved doctrinal splits. 3.2 Ingestion P… view at source ↗
Figure 5
Figure 5. Figure 5: The Falkor-IRAC graph-constrained generation architecture. The Retrieval Agent traverses the IRAC Knowledge Graph in FalkorDB and returns path-guided context. The LLM Generator synthesises an answer. The Verifier Agent (falsifiability oracle) checks whether a valid citation path exists in the graph. On failure the system revises or abstains; on pass the verified answer with its citation chain is returned. … view at source ↗
Figure 7
Figure 7. Figure 7: Conflict detection as a first-class output. Case A (2014) and Case B (2016), both coordinate bench decisions, are connected by a CONFLICTS_WITH edge on the same legal proposition. Because no RESOLVED_BY edge exists, the Verifier returns CONFLICT status and surfaces both paths with conflict metadata rather than silently choosing one side. For a legal practitioner, explicit conflict disclosure is more useful… view at source ↗
Figure 8
Figure 8. Figure 8: GraphRAG versus graph-constrained generation. In GraphRAG (left), the graph improves retrieval but the LLM generates freely and the answer is not guaranteed to be verifiable. In Falkor-IRAC (right), the LLM generates with path guidance and the answer is accepted only if a valid supporting path exists in the graph. The distinction is between a better library and an external examiner with veto power. 3.4 Con… view at source ↗
read the original abstract

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work. The companion InIRAC dataset, 500+ structured Indian court judgments with IRAC annotations, is released alongside this paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Falkor-IRAC, a graph-constrained generation framework for Indian legal AI. It constructs an IRAC (Issue-Rule-Analysis-Conclusion) knowledge graph from Supreme Court and High Court judgments, enriched with procedural transitions, precedent links, and statutory references, stored in FalkorDB. At inference, an LLM-generated answer is accepted only if a valid supporting path exists in the graph, as checked by a Verifier Agent that also surfaces doctrinal conflicts. The system is positioned as superior to vector RAG for avoiding hallucinated precedents. Evaluation is reported on a proof-of-concept corpus of 51 Supreme Court judgments, where the Verifier Agent is said to have correctly validated citations and rejected fabrications; full quantitative metrics, baselines, and the InIRAC dataset (500+ annotated judgments) are released.

Significance. If the central claims are substantiated, the work offers a concrete engineering approach to grounding legal generation in symbolic graph traversal rather than semantic similarity, which could reduce certain classes of hallucination in high-stakes domains. The explicit release of the InIRAC dataset with IRAC annotations is a clear positive contribution that enables future reproducible research. The emphasis on graph-native metrics (citation grounding accuracy, path validity rate, hallucinated precedent rate) over BLEU/ROUGE is conceptually appropriate for the domain.

major comments (2)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the statement that 'the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations' on the 51-judgment corpus supplies no quantitative metrics, error rates, confusion matrix, or statistical analysis. This is load-bearing for the central claim of verified reasoning; without these numbers the empirical support remains anecdotal.
  2. [Verifier Agent description] Verifier Agent and IRAC graph construction: the manuscript provides no mechanism to detect cases in which a valid path exists yet the generated reasoning still misapplies precedent, omits statutory constraints outside the selected 51 judgments, or encodes an incorrect doctrinal inference. Because the graph is derived solely from the proof-of-concept corpus, path existence is necessary but not shown to be sufficient for legal soundness or non-hallucination.
minor comments (2)
  1. [Evaluation] The paper states that comparison to vector-only RAG baselines is left for future work; this should be explicitly flagged as a limitation in the current evaluation section rather than deferred without further detail.
  2. [Methods] Notation for IRAC node types and edge semantics could be formalized earlier (e.g., a small table or diagram legend) to aid readers unfamiliar with the specific graph schema.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and substantive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the empirical presentation and clarify limitations of the proof-of-concept evaluation.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the statement that 'the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations' on the 51-judgment corpus supplies no quantitative metrics, error rates, confusion matrix, or statistical analysis. This is load-bearing for the central claim of verified reasoning; without these numbers the empirical support remains anecdotal.

    Authors: We agree that the current description of results on the 51-judgment corpus is primarily qualitative and lacks the requested quantitative support. As this constitutes a proof-of-concept evaluation rather than a comprehensive benchmark, we reported illustrative outcomes from manual inspection of paths. In the revised manuscript we will add a results table reporting citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate on the corpus, together with a confusion matrix for the Verifier Agent's accept/reject decisions and a brief error analysis. These additions will be placed in the Evaluation section and referenced from the abstract. revision: yes

  2. Referee: [Verifier Agent description] Verifier Agent and IRAC graph construction: the manuscript provides no mechanism to detect cases in which a valid path exists yet the generated reasoning still misapplies precedent, omits statutory constraints outside the selected 51 judgments, or encodes an incorrect doctrinal inference. Because the graph is derived solely from the proof-of-concept corpus, path existence is necessary but not shown to be sufficient for legal soundness or non-hallucination.

    Authors: The referee correctly notes an inherent limitation of the current graph construction: because the IRAC graph is built exclusively from the 51-judgment proof-of-concept corpus, the existence of a supporting path provides structural grounding but cannot by itself rule out misapplication of precedent, omission of external statutory constraints, or incorrect doctrinal inferences. The Verifier Agent currently enforces path validity and surfaces explicit doctrinal conflicts; it does not perform deeper semantic entailment checking. We will revise the manuscript to state explicitly that path existence is a necessary but not sufficient condition for legal soundness, to discuss this boundary in the Limitations section, and to outline future extensions that combine graph traversal with additional semantic or hybrid verification layers on the larger InIRAC dataset. revision: partial

Circularity Check

0 steps flagged

No circularity: framework correctness checked against explicitly constructed external graph

full rationale

The paper describes an engineering system that ingests 51 judgments into an IRAC graph and uses a Verifier Agent to accept outputs only when a supporting path exists in that graph. No equations, fitted parameters, or derivations are presented that reduce to their own inputs by construction. The graph is openly derived from the selected corpus, and the evaluation metric (path validity) directly measures consistency with that encoding rather than claiming an independent proof of legal soundness. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The system is therefore self-contained against its stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that IRAC decomposition plus procedural and precedent links can faithfully represent legal reasoning for verification purposes. No free parameters or invented physical entities are introduced; the main new elements are the Verifier Agent and the particular graph schema.

axioms (1)
  • domain assumption IRAC structure plus procedural state transitions and statutory references are sufficient to encode the constrained symbolic reasoning present in Indian court judgments.
    Invoked when judgments are ingested as IRAC node structures for graph traversal and verification.
invented entities (1)
  • Verifier Agent no independent evidence
    purpose: Performs falsifiability check by tracing whether a supporting path exists in the IRAC graph before accepting an LLM-generated answer.
    Introduced as the core mechanism for enforcing graph constraints; no independent evidence outside the system is provided.

pith-pipeline@v0.9.0 · 5842 in / 1433 out tokens · 38307 ms · 2026-05-19T16:24:30.240081+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

    cs.CL 2026-05 accept novelty 6.0

    IMLJD is a new open dataset of 3,613 Indian matrimonial litigation judgments from the Supreme Court (2000-2024) and Karnataka High Court (2018-2024) that reports a 18-19.6 percentage point higher success rate for quas...

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    K., Ghosh, K., Guha, S

    Malik, V., Sanjay, R., Nigam, S. K., Ghosh, K., Guha, S. K., Bhattacharya, A., & Modi, A. (2021, August). ILDC for CJPE: Indian legal documents corpus for court judgment prediction and explanation. In Proceedings of the 59th Annual Meeting of the Association for Computational 20 Linguistics and the 11th International Joint Conference on Natural Language P...

  2. [2]

    NyayaAnumana: Indian Legal Judgment Prediction Dataset

    Law -AI Lab, IIT Kharagpur. NyayaAnumana: Indian Legal Judgment Prediction Dataset. https://huggingface.co/collections/L-NLProc/nyayaanumana-and-inlegalllama-models

  3. [3]

    MILPaC: Multilingual Indian Legal Parallel Corpus

    Law -AI Lab, IIT Kharagpur. MILPaC: Multilingual Indian Legal Parallel Corpus. https://github.com/Law-AI/MILPaC

  4. [4]

    https://www.kaggle.com/datasets/kmldas/indiclegalqa-dataset

    IndicLegalQA: A dataset for legal question answering in the Indian judicial context (2025). https://www.kaggle.com/datasets/kmldas/indiclegalqa-dataset

  5. [5]

    https://www.falkordb.com

    FalkorDB: GraphRAG at scale. https://www.falkordb.com

  6. [6]

    https://bhashini.gov.in

    Bhashini: National Language Translation Mission, Ministry of Electronics and Information Technology, Government of India. https://bhashini.gov.in

  7. [7]

    Song, D., Bonifazi, G., Schilder, F., and Schwarz, J.R. (2026). Knowledge Graph -Assisted LLM Post -Training for Enhanced Legal Reasoning. arXiv:2601.13806. https://arxiv.org/abs/2601.13806

  8. [8]

    Han, S. (2026). Trustworthy Legal Reasoning: A Comprehensive Survey. Preprints.org. https://www.preprints.org/manuscript/202602.0870

  9. [9]

    Karna, V.R. (2026). A Hybrid RAG-LLaMA Framework for Scalable and Accurate Interpretation of Legal Texts. Applied Artificial Intelligence, 40(1)

  10. [10]

    Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020

  11. [11]

    Asai, A. et al. (2023). Self -RAG: Learning to Retrieve, Generate, and Critique through Self - Reflection. arXiv:2310.11511

  12. [12]

    Anthropic. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073

  13. [13]

    Awasekar, D., & Lobo, L. M. R. J. (2026). NyayaSakhi–SWATI: India’s First Statute-Aligned, Retrieval-Augmented LAMP² 4.0 AI -Powered Digital Legal Companion for Victims of Domestic - Violence. Journal of Engineering Education Transformations , 601 -606. https://journaleet.in/index.php/jeet/article/view/3668

  14. [14]

    Van Ruymbeke, S., Baeck, J., Mulier, K., & Demeester, T. (2026). Artificial intelligence in the judiciary: a systematic literature review on the practical applications. Information & Communications Technology Law, 1–33. https://doi.org/10.1080/13600834.2026.2644818

  15. [15]

    Bose, J. (2026). InIRAC: Indian IRAC Legal Reasoning Dataset (v0.1). HuggingFace Datasets. https://huggingface.co/datasets/joyboseroy/inIRAC