pith. sign in

arxiv: 2606.03728 · v1 · pith:IIE5OL27new · submitted 2026-06-02 · 💻 cs.CL · cs.IR

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

Pith reviewed 2026-06-28 10:32 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords legal question answeringcitation qualityre-rankingattribution methodsretrieval augmented generationcross-encoderperturbation-based attribution
0
0 comments X

The pith

Re-ranking passages with attribution scores improves citation faithfulness in legal QA systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in retrieval-augmented legal question answering, semantic similarity between query and passage does not predict which passages the model will actually cite in its answer. On the AQuAECHR benchmark, random selection from candidates outperforms similarity-based ranking for finding gold citations. Instead, the authors train a cross-encoder re-ranker using continuous scores from perturbation-based attribution methods like C-LIME. This re-ranker boosts how faithfully the generated answers cite relevant passages and better matches expert gold answers. The approach also reveals that re-rankers trained on different language models converge on a common relevance signal.

Core claim

On the AQuAECHR benchmark, semantic similarity ranking performs worse than random at surfacing gold citation paragraphs for legal QA. Training a lightweight cross-encoder on perturbation-based attribution scores to re-rank passages prior to generation substantially improves citation faithfulness and alignment with gold expert answers across two language models and five-fold cross-validation. Re-rankers trained independently on different models converge beyond their raw attribution agreement, indicating the cross-encoder reduces model-specific noise and produces a shared relevance signal.

What carries the argument

lightweight cross-encoder re-ranker trained on continuous perturbation-based attribution scores such as C-LIME

If this is right

  • The re-ranker substantially improves citation faithfulness.
  • Alignment with gold expert answers increases.
  • Re-rankers from different models produce a shared relevance signal that partially transfers across models.
  • Same-model re-ranking remains more effective than cross-model.
  • Perturbation-based attribution provides a practical training signal for citation-aware retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attribution-based re-ranking could be tested in non-legal domains where citation accuracy matters.
  • The convergence suggests potential for model-agnostic citation retrieval systems.
  • Future work might explore whether this signal improves overall answer quality beyond citations.

Load-bearing premise

That perturbation-based attribution scores like C-LIME are a reliable and generalizable training signal for citation quality superior to semantic similarity within the candidate pool.

What would settle it

A failure of the trained cross-encoder to improve citation faithfulness metrics compared to the baseline retriever on the AQuAECHR benchmark under five-fold cross-validation would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.03728 by Mohamed Hesham Elganayni, Selim Saleh.

Figure 1
Figure 1. Figure 1: C-LIME re-ranking pipeline. Top: baseline retrieval surfaces passages dominated by Art. 10 media criteria (red). Middle: perturbation-based attribution trains a cross-encoder offline. Bottom: re-ranking promotes Art. 6 disclosure passages (green) to top ranks. Gold-cited passages are marked with both yellow shading and stars (⋆). Rank shifts shown at right. receives a high score. Training a re-ranker on th… view at source ↗
read the original abstract

Retrieval-augmented generation systems for legal question answering typically retrieve passages based on semantic similarity and provide them to a language model, which then generates cited answers. Prior work assumes that highly ranked passages are most likely to be usefully cited by the model. Perturbation-based attribution methods, such as C-LIME, have been used exclusively for post-hoc explanation. However, on the AQuAECHR benchmark, semantic similarity does not correlate with passage attribution. Within a retriever's candidate pool, similarity-based ranking performs worse than random selection at surfacing gold citation paragraphs. To address this limitation, a lightweight cross-encoder is trained on continuous perturbation-based attribution scores to re-rank passages prior to generation. This approach is evaluated on the AQuAECHR benchmark, using two language models and five-fold cross-validation. The re-ranker substantially improves citation faithfulness and alignment with gold expert answers. Notably, two re-rankers trained independently on different models converge beyond their raw attribution agreement. This finding indicates that the cross-encoder reduces model-specific noise and produces a shared relevance signal that partially transfers across models, although same-model re-ranking remains more effective. These results demonstrate that perturbation-based attribution provides a practical, model-agnostic training signal for citation-aware retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that in retrieval-augmented legal QA, semantic similarity ranking within a retriever's candidate pool correlates poorly with perturbation-based attribution (C-LIME) and underperforms random selection at surfacing gold citation paragraphs on the AQuAECHR benchmark. A lightweight cross-encoder is trained on continuous attribution scores to re-rank passages before generation; evaluated with two language models and five-fold cross-validation, this re-ranker improves citation faithfulness and alignment with gold expert answers. Two independently trained re-rankers converge beyond their raw attribution agreement, indicating that the cross-encoder reduces model-specific noise and yields a partially transferable shared relevance signal (though same-model re-ranking remains stronger).

Significance. If the empirical results hold, the work supplies a practical, model-agnostic training signal for citation-aware retrieval that leverages existing attribution methods rather than requiring gold citation labels. The observation of cross-model convergence after independent training on different LMs is a notable strength, as is the use of five-fold cross-validation; both provide evidence that the approach can distill a general relevance signal. These elements address a documented mismatch between semantic similarity and model attribution in a high-stakes domain.

major comments (2)
  1. [Abstract] Abstract: the central claim that attribution-trained re-ranking 'substantially improves citation faithfulness and alignment with gold expert answers' is load-bearing yet unsupported by any reported quantitative deltas, confidence intervals, or per-metric values; without these numbers the magnitude of improvement cannot be assessed relative to the semantic-similarity and random baselines already mentioned.
  2. [Abstract] Abstract / Evaluation description: the premise that continuous C-LIME attribution scores constitute a superior training signal is invoked to justify training the cross-encoder, but no ablation is described that compares this signal against an identically architected cross-encoder trained on gold citation labels (or on model log-probabilities); such a comparison is required to rule out the possibility that any cross-encoder simply learns a generic relevance function.
minor comments (1)
  1. [Abstract] Abstract: the benchmark is referred to only as 'AQuAECHR'; an expansion or citation on first use would aid readers unfamiliar with the dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that attribution-trained re-ranking 'substantially improves citation faithfulness and alignment with gold expert answers' is load-bearing yet unsupported by any reported quantitative deltas, confidence intervals, or per-metric values; without these numbers the magnitude of improvement cannot be assessed relative to the semantic-similarity and random baselines already mentioned.

    Authors: We agree with this observation. The abstract currently summarizes the improvements qualitatively. In the revised version, we will include specific quantitative deltas, confidence intervals, and per-metric values from our five-fold cross-validation experiments to allow readers to assess the magnitude of improvement over the semantic similarity and random baselines. revision: yes

  2. Referee: [Abstract] Abstract / Evaluation description: the premise that continuous C-LIME attribution scores constitute a superior training signal is invoked to justify training the cross-encoder, but no ablation is described that compares this signal against an identically architected cross-encoder trained on gold citation labels (or on model log-probabilities); such a comparison is required to rule out the possibility that any cross-encoder simply learns a generic relevance function.

    Authors: This is a valid point. Our work focuses on using attribution scores as a training signal in the absence of gold labels, which is a key practical advantage in legal QA. Nevertheless, to address the concern and strengthen the evaluation, we will perform and report an ablation study in the revised manuscript. This will involve training an identical cross-encoder on gold citation labels from the AQuAECHR benchmark and comparing its performance to the attribution-based re-ranker. We will also consider including a comparison to log-probability based training if feasible. The cross-model convergence results already suggest the signal is not entirely generic, but the proposed ablation will provide direct evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external signals and benchmark

full rationale

The paper trains a cross-encoder re-ranker using continuous C-LIME attribution scores (external perturbation method) as the training target and evaluates citation faithfulness plus expert-answer alignment on the independent AQuAECHR benchmark with five-fold cross-validation. The reported cross-model convergence is an empirical observation on held-out data, not a quantity that reduces by the paper's own equations or definitions to a fitted parameter or self-citation. No self-definitional steps, fitted-input-called-prediction, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described method. The central claims remain falsifiable against external gold labels and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, results tables, and any hyperparameter details are unavailable, limiting the ability to enumerate all free parameters or domain assumptions.

free parameters (1)
  • cross-encoder training hyperparameters
    Specific learning rate, batch size, and architecture choices for the re-ranker are not stated in the abstract yet determine the re-ranking signal.
axioms (1)
  • domain assumption Perturbation-based attribution scores provide a better proxy for citation utility than semantic similarity within a fixed candidate pool
    This premise is required for the decision to train the cross-encoder on attribution scores rather than similarity.

pith-pipeline@v0.9.1-grok · 5752 in / 1339 out tokens · 24815 ms · 2026-06-28T10:32:54.070626+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages

  1. [1]

    D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, GPT-4 passes the bar exam, 2024. doi:10.2139/ ssrn.4389233

  2. [2]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL: https://arxiv.org/abs/2005.11401.arXiv:2005.11401

  3. [3]

    K. Q. Weidinger, S. T.y.s.s, O. Ichim, M. Grabmair, AQuAECHR: Attributed question answering for European court of human rights, 2025. URL: https://aclanthology.org/2025.findings-acl.74/. doi:10.18653/v1/2025.findings-acl.74

  4. [4]

    T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations,

  5. [5]

    URL: https://arxiv.org/abs/2305.14627.arXiv:2305.14627

  6. [6]

    Wallat, M

    J. Wallat, M. Heuss, M. de Rijke, A. Anand, Correctness is not faithfulness in rag attributions, 2024. URL: https://arxiv.org/abs/2412.18004.arXiv:2412.18004

  7. [7]

    ISBN 979-8-89176-251-0

    L. Monteiro Paes, D. Wei, H. J. Do, Others, Multi-level explanations for generative language mod- els, 2025. URL: https://aclanthology.org/2025.acl-long.1553/. doi:10.18653/v1/2025.acl-long. 1553

  8. [8]

    Nogueira, K

    R. Nogueira, K. Cho, Passage re-ranking with bert, 2020. URL: https://arxiv.org/abs/1901.04085. arXiv:1901.04085

  9. [9]

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023. URL: https://arxiv.org/abs/2307.03172. arXiv:2307.03172

  10. [10]

    Randl, G

    K. Randl, G. Rocchietti, A. Henriksson, Z. Abedjan, T. Lindgren, J. Pavlopoulos, Rag-e: Quantifying retriever-generator alignment and failure modes, 2026. URL: https://arxiv.org/abs/2601.21803. arXiv:2601.21803

  11. [11]

    Nogueira, Z

    R. Nogueira, Z. Jiang, J. Lin, Document ranking with a pretrained sequence-to-sequence model,

  12. [12]

    URL: https://arxiv.org/abs/2003.06713.arXiv:2003.06713

  13. [13]

    Zhuang, Z

    H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma, J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses, 2022. URL: https://arxiv.org/abs/2210.10634. arXiv:2210.10634

  14. [14]

    X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-tuning llama for multi-stage text retrieval, 2023. URL: https://arxiv.org/abs/2310.08319.arXiv:2310.08319

  15. [15]

    Pradeep, S

    R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL: https://arxiv.org/abs/2312.02724.arXiv:2312.02724

  16. [16]

    Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-augmented generation for large language models: A survey, 2024. URL: https://arxiv.org/abs/2312.10997. arXiv:2312.10997

  17. [17]

    W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Is chatgpt good at search? investigating large language models as re-ranking agents, 2024. URL: https://arxiv.org/abs/2304. 09542.arXiv:2304.09542

  18. [18]

    P. Jia, D. Xu, X. Li, Z. Du, X. Li, Y. Wang, Y. Wang, Q. Liu, M. Wang, H. Guo, R. Tang, X. Zhao, Bridging relevance and reasoning: Rationale distillation in retrieval-augmented generation, 2025. URL: https://arxiv.org/abs/2412.08519.arXiv:2412.08519

  19. [19]

    Rashkin, V

    H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, D. Reitter, Measuring attribution in natural language generation models, 2022. URL: https://arxiv. org/abs/2112.12870.arXiv:2112.12870

  20. [20]

    Gekhman, J

    Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, I. Szpektor, Trueteacher: Learning factual con- sistency evaluation with large language models, 2023. URL: https://arxiv.org/abs/2305.11171. arXiv:2305.11171

  21. [21]

    Pipitone, G

    N. Pipitone, G. H. Alami, Legalbench-rag: A benchmark for retrieval-augmented generation in the legal domain, 2024. URL: https://arxiv.org/abs/2408.10343.arXiv:2408.10343

  22. [22]

    Chalkidis, M

    I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis, N. Aletras, I. Androutsopoulos, P. Malakasiotis, Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases, 2021. URL: https://arxiv.org/abs/2103.13084.arXiv:2103.13084

  23. [23]

    Y. Xu, J. Gao, X. Yu, Y. Xue, B. Bi, H. Shen, X. Cheng, Training a utility-based retriever through shared context attribution for retrieval-augmented language models, 2026. URL: https://arxiv.org/ abs/2504.00573.arXiv:2504.00573

  24. [24]

    J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Ábrego, J. Ma, V. Y. Zhao, Y. Luan, K. B. Hall, M.-W. Chang, Y. Yang, Large dual encoders are generalizable retrievers, 2021. URL: https://arxiv.org/abs/2112.07899. arXiv:2112.07899

  25. [25]

    Reimers, I

    N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL: https://arxiv.org/abs/1908.10084.arXiv:1908.10084

  26. [26]

    Lin, ROUGE: A package for automatic evaluation of summaries, 2004

    C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, 2004. URL: https:// aclanthology.org/W04-1013/. Table 5 Key passage rank changes under re-ranking (Mistral pointwise). GTR = original retriever rank within top-50. RR = re-ranked position in selected top-10. Passages marked with ⋆ appear in the gold citation or target answer. “– ” indicate...

  27. [27]

    criteria

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL: https://arxiv.org/ abs/1910.10683.arXiv:1910.10683. A. Qualitative Examples This appendix presents two representative questions contrasting baseline (GTR top-10) and re-ra...