pith. sign in

arxiv: 2601.12419 · v2 · submitted 2026-01-18 · 💻 cs.CL

Legal Experts Disagree With Rationale Extraction Techniques for Explaining ECtHR Case Outcome Classification

Pith reviewed 2026-05-16 13:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords ECtHRlegal outcome predictionrationale extractioninterpretabilityfaithfulnessplausibilityLLM judgehuman rights cases
0
0 comments X

The pith

Rationale extraction techniques for ECtHR outcome prediction yield explanations that legal experts find substantially different from their own reasoning, even with strong faithfulness scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how well current methods explain large language models that predict whether the European Court of Human Rights will find a violation in a case. It introduces a new dataset of carefully selected positive and negative cases and applies a framework to compare rationale extraction techniques. These techniques produce short text excerpts from the decisions to justify the model's output. Faithfulness is measured by how well the excerpts suffice for or cover the prediction, while plausibility comes from legal experts rating whether the excerpts match what they would consider relevant. The results indicate a clear gap between model-derived reasons and expert reasoning despite the faithfulness metrics.

Core claim

On the new ECtHR dataset, rationale extraction techniques justify model predictions of violations with concise input fragments that achieve high normalized sufficiency and comprehensiveness scores, indicating faithfulness to the model. However, legal experts assess these fragments as differing substantially from the reasons they use to determine violations or non-violations in the same cases.

What carries the argument

Rationale extraction techniques that select concise, human-interpretable text fragments from case documents to explain model outcome predictions, evaluated through faithfulness metrics and expert plausibility judgments.

If this is right

  • Faithfulness metrics alone are insufficient to validate explanations in legal prediction tasks.
  • Plausibility must be assessed separately using domain expert judgments for legal applications.
  • LLM-as-a-Judge shows promise but requires expert references to calibrate its reliability.
  • The curated ECtHR dataset supports systematic comparison of interpretability methods on both violation and non-violation cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that current extraction methods may need to incorporate legal domain knowledge to generate more plausible rationales.
  • In practice, deploying such models for legal advice would require additional layers of human review to ensure explanations align with professional standards.
  • Testing these techniques on other legal datasets could reveal whether the expert disagreement is specific to ECtHR or general across judicial domains.

Load-bearing premise

Legal experts' assessments of rationale plausibility serve as an accurate and consistent ground truth for evaluating explanation quality in ECtHR cases.

What would settle it

If a new rationale extraction method were developed that produces fragments legal experts consistently rate as matching their reasoning process while preserving high faithfulness scores on the same dataset, that would support the possibility of aligning model explanations with expert views.

read the original abstract

Interpretability is critical for applications of large language models (LLMs) in the legal domain, where trust and transparency are essential. A central NLP task in this setting is legal outcome prediction, where models forecast whether a court will find a violation of a given right. We study this task on decisions from the European Court of Human Rights (ECtHR), introducing a new ECtHR dataset with carefully curated positive (violation) and negative (non-violation) cases. Existing works propose both task-specific approaches and model-agnostic techniques to explain downstream performance, but it remains unclear which techniques best explain legal outcome prediction. To address this, we propose a comparative analysis framework for model-agnostic interpretability methods. We focus on two rationale extraction techniques that justify model outputs with concise, human-interpretable text fragments from the input. We evaluate faithfulness via normalized sufficiency and comprehensiveness metrics, and plausibility via legal expert judgments of the extracted rationales. We also assess the feasibility of using LLM-as-a-Judge, using these expert evaluations as reference. Our experiments on the new ECtHR dataset show that models' "reasons" for predicting violations differ substantially from those of legal experts, despite strong faithfulness scores. The source code of our experiments is publicly available at https://github.com/trusthlt/IntEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a new curated ECtHR dataset of positive (violation) and negative (non-violation) cases for legal outcome prediction. It proposes a comparative framework for two model-agnostic rationale extraction techniques, evaluating them on faithfulness via normalized sufficiency and comprehensiveness metrics and on plausibility via legal expert judgments. It further tests LLM-as-a-Judge against the expert reference. The central finding is that model-extracted rationales differ substantially from expert reasoning despite strong faithfulness scores. Source code is released publicly.

Significance. If the results hold, the work demonstrates that faithfulness metrics alone are insufficient to validate explanations in legal NLP, revealing a gap between automated rationales and domain-expert reasoning in a high-stakes setting. This has implications for trust and transparency in legal AI applications. The public code release is a clear strength supporting reproducibility and follow-up studies.

major comments (2)
  1. [Abstract / Experiments] Abstract and experiments section: The claim that extracted rationales 'differ substantially' from legal experts rests on expert judgments as ground truth for plausibility. No details are given on the number of experts, selection criteria, inter-annotator agreement (e.g., Fleiss' kappa or pairwise rates), or the precise rating protocol used to assess divergence. Without these metrics, the mismatch could reflect annotation variance rather than model-expert misalignment.
  2. [Dataset] Dataset section: The manuscript describes a 'carefully curated' set of positive and negative ECtHR cases but provides no explicit criteria for case selection, filtering, or balancing. This is load-bearing for the generalizability of the reported divergence, as selection bias could systematically affect the apparent differences between model and expert rationales.
minor comments (1)
  1. [Abstract] The GitHub link is a positive feature for reproducibility, but the abstract could name the two specific rationale extraction techniques evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below. We agree that additional details are needed for clarity and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experiments section: The claim that extracted rationales 'differ substantially' from legal experts rests on expert judgments as ground truth for plausibility. No details are given on the number of experts, selection criteria, inter-annotator agreement (e.g., Fleiss' kappa or pairwise rates), or the precise rating protocol used to assess divergence. Without these metrics, the mismatch could reflect annotation variance rather than model-expert misalignment.

    Authors: We agree that these details are essential to substantiate the plausibility evaluation and rule out annotation variance as an explanation for the observed divergence. The current manuscript does not report them, which we will correct in the revision by adding a new subsection in the Experiments section describing the expert annotation process, including the number of experts, selection criteria, inter-annotator agreement, and rating protocol. revision: yes

  2. Referee: [Dataset] Dataset section: The manuscript describes a 'carefully curated' set of positive and negative ECtHR cases but provides no explicit criteria for case selection, filtering, or balancing. This is load-bearing for the generalizability of the reported divergence, as selection bias could systematically affect the apparent differences between model and expert rationales.

    Authors: We agree that explicit criteria are required to support reproducibility and to address potential selection bias concerns. The manuscript currently lacks this level of detail. In the revised version, we will expand the Dataset section with a full description of the curation process, including selection criteria, filtering steps, and balancing approach for positive and negative cases. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on new data with external expert judgments

full rationale

The paper introduces a new ECtHR dataset and performs an empirical comparison of rationale extraction methods using standard faithfulness metrics (normalized sufficiency and comprehensiveness) plus plausibility ratings from legal experts. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Central claims rest on experimental outcomes and independent expert assessments rather than any reduction to inputs by construction, self-citation chains, or renamed known results. The work is self-contained against external benchmarks with no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Relies on standard assumptions from interpretability literature that sufficiency/comprehensiveness metrics capture faithfulness and that expert annotations provide valid plausibility labels.

axioms (2)
  • domain assumption Normalized sufficiency and comprehensiveness metrics are appropriate measures of rationale faithfulness.
    Invoked when evaluating the two rationale extraction techniques.
  • domain assumption Legal expert judgments serve as reliable reference for plausibility.
    Central to the human evaluation component.

pith-pipeline@v0.9.0 · 5548 in / 1231 out tokens · 58089 ms · 2026-05-16T13:07:25.597495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.