Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

Mohamed Hesham Elganayni; Selim Saleh

arxiv: 2606.03728 · v1 · pith:IIE5OL27new · submitted 2026-06-02 · 💻 cs.CL · cs.IR

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

Mohamed Hesham Elganayni , Selim Saleh This is my paper

Pith reviewed 2026-06-28 10:32 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords legal question answeringcitation qualityre-rankingattribution methodsretrieval augmented generationcross-encoderperturbation-based attribution

0 comments

The pith

Re-ranking passages with attribution scores improves citation faithfulness in legal QA systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in retrieval-augmented legal question answering, semantic similarity between query and passage does not predict which passages the model will actually cite in its answer. On the AQuAECHR benchmark, random selection from candidates outperforms similarity-based ranking for finding gold citations. Instead, the authors train a cross-encoder re-ranker using continuous scores from perturbation-based attribution methods like C-LIME. This re-ranker boosts how faithfully the generated answers cite relevant passages and better matches expert gold answers. The approach also reveals that re-rankers trained on different language models converge on a common relevance signal.

Core claim

On the AQuAECHR benchmark, semantic similarity ranking performs worse than random at surfacing gold citation paragraphs for legal QA. Training a lightweight cross-encoder on perturbation-based attribution scores to re-rank passages prior to generation substantially improves citation faithfulness and alignment with gold expert answers across two language models and five-fold cross-validation. Re-rankers trained independently on different models converge beyond their raw attribution agreement, indicating the cross-encoder reduces model-specific noise and produces a shared relevance signal.

What carries the argument

lightweight cross-encoder re-ranker trained on continuous perturbation-based attribution scores such as C-LIME

If this is right

The re-ranker substantially improves citation faithfulness.
Alignment with gold expert answers increases.
Re-rankers from different models produce a shared relevance signal that partially transfers across models.
Same-model re-ranking remains more effective than cross-model.
Perturbation-based attribution provides a practical training signal for citation-aware retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attribution-based re-ranking could be tested in non-legal domains where citation accuracy matters.
The convergence suggests potential for model-agnostic citation retrieval systems.
Future work might explore whether this signal improves overall answer quality beyond citations.

Load-bearing premise

That perturbation-based attribution scores like C-LIME are a reliable and generalizable training signal for citation quality superior to semantic similarity within the candidate pool.

What would settle it

A failure of the trained cross-encoder to improve citation faithfulness metrics compared to the baseline retriever on the AQuAECHR benchmark under five-fold cross-validation would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.03728 by Mohamed Hesham Elganayni, Selim Saleh.

**Figure 1.** Figure 1: C-LIME re-ranking pipeline. Top: baseline retrieval surfaces passages dominated by Art. 10 media criteria (red). Middle: perturbation-based attribution trains a cross-encoder offline. Bottom: re-ranking promotes Art. 6 disclosure passages (green) to top ranks. Gold-cited passages are marked with both yellow shading and stars (⋆). Rank shifts shown at right. receives a high score. Training a re-ranker on th… view at source ↗

read the original abstract

Retrieval-augmented generation systems for legal question answering typically retrieve passages based on semantic similarity and provide them to a language model, which then generates cited answers. Prior work assumes that highly ranked passages are most likely to be usefully cited by the model. Perturbation-based attribution methods, such as C-LIME, have been used exclusively for post-hoc explanation. However, on the AQuAECHR benchmark, semantic similarity does not correlate with passage attribution. Within a retriever's candidate pool, similarity-based ranking performs worse than random selection at surfacing gold citation paragraphs. To address this limitation, a lightweight cross-encoder is trained on continuous perturbation-based attribution scores to re-rank passages prior to generation. This approach is evaluated on the AQuAECHR benchmark, using two language models and five-fold cross-validation. The re-ranker substantially improves citation faithfulness and alignment with gold expert answers. Notably, two re-rankers trained independently on different models converge beyond their raw attribution agreement. This finding indicates that the cross-encoder reduces model-specific noise and produces a shared relevance signal that partially transfers across models, although same-model re-ranking remains more effective. These results demonstrate that perturbation-based attribution provides a practical, model-agnostic training signal for citation-aware retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows attribution scores can train a re-ranker that beats semantic similarity for citation faithfulness in legal QA, but the lack of gold-label baselines leaves the specific value of the attribution signal unclear.

read the letter

The core move here is training a cross-encoder re-ranker directly on continuous C-LIME attribution scores instead of using them only for post-hoc explanation. On the AQuAECHR benchmark the authors report that semantic similarity correlates poorly with attribution and underperforms random ranking for surfacing gold citations. The re-ranker then improves faithfulness and alignment with expert answers, and two independently trained re-rankers converge more than their raw attribution scores agree.

What stands out is the concrete demonstration that attribution can serve as a training signal for retrieval in a domain where citation accuracy matters. The cross-model convergence is a nice observation and suggests the re-ranker is picking up something more stable than model-specific noise.

The main gap is the missing comparison. The abstract does not show whether an identical cross-encoder trained on gold citation labels or on model log-probabilities would produce similar or larger gains. Without that ablation it is hard to know whether the attribution scores are doing unique work or whether any reasonable relevance signal would suffice once you move to a cross-encoder. The five-fold cross-validation across two models is mentioned but no deltas, confidence intervals, or ablation tables appear in the abstract, so the size of the improvement is still opaque.

The work is scoped to RAG systems for legal QA rather than a broad claim about attribution methods. Readers already working on citation-faithful retrieval or legal-domain RAG would find the setup and the negative result on semantic ranking useful to see. It is coherent on its own terms and engages the literature it cites, so it clears the bar for a serious referee even if the central claim needs tighter controls on the training signal.

Referee Report

2 major / 1 minor

Summary. The paper claims that in retrieval-augmented legal QA, semantic similarity ranking within a retriever's candidate pool correlates poorly with perturbation-based attribution (C-LIME) and underperforms random selection at surfacing gold citation paragraphs on the AQuAECHR benchmark. A lightweight cross-encoder is trained on continuous attribution scores to re-rank passages before generation; evaluated with two language models and five-fold cross-validation, this re-ranker improves citation faithfulness and alignment with gold expert answers. Two independently trained re-rankers converge beyond their raw attribution agreement, indicating that the cross-encoder reduces model-specific noise and yields a partially transferable shared relevance signal (though same-model re-ranking remains stronger).

Significance. If the empirical results hold, the work supplies a practical, model-agnostic training signal for citation-aware retrieval that leverages existing attribution methods rather than requiring gold citation labels. The observation of cross-model convergence after independent training on different LMs is a notable strength, as is the use of five-fold cross-validation; both provide evidence that the approach can distill a general relevance signal. These elements address a documented mismatch between semantic similarity and model attribution in a high-stakes domain.

major comments (2)

[Abstract] Abstract: the central claim that attribution-trained re-ranking 'substantially improves citation faithfulness and alignment with gold expert answers' is load-bearing yet unsupported by any reported quantitative deltas, confidence intervals, or per-metric values; without these numbers the magnitude of improvement cannot be assessed relative to the semantic-similarity and random baselines already mentioned.
[Abstract] Abstract / Evaluation description: the premise that continuous C-LIME attribution scores constitute a superior training signal is invoked to justify training the cross-encoder, but no ablation is described that compares this signal against an identically architected cross-encoder trained on gold citation labels (or on model log-probabilities); such a comparison is required to rule out the possibility that any cross-encoder simply learns a generic relevance function.

minor comments (1)

[Abstract] Abstract: the benchmark is referred to only as 'AQuAECHR'; an expansion or citation on first use would aid readers unfamiliar with the dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that attribution-trained re-ranking 'substantially improves citation faithfulness and alignment with gold expert answers' is load-bearing yet unsupported by any reported quantitative deltas, confidence intervals, or per-metric values; without these numbers the magnitude of improvement cannot be assessed relative to the semantic-similarity and random baselines already mentioned.

Authors: We agree with this observation. The abstract currently summarizes the improvements qualitatively. In the revised version, we will include specific quantitative deltas, confidence intervals, and per-metric values from our five-fold cross-validation experiments to allow readers to assess the magnitude of improvement over the semantic similarity and random baselines. revision: yes
Referee: [Abstract] Abstract / Evaluation description: the premise that continuous C-LIME attribution scores constitute a superior training signal is invoked to justify training the cross-encoder, but no ablation is described that compares this signal against an identically architected cross-encoder trained on gold citation labels (or on model log-probabilities); such a comparison is required to rule out the possibility that any cross-encoder simply learns a generic relevance function.

Authors: This is a valid point. Our work focuses on using attribution scores as a training signal in the absence of gold labels, which is a key practical advantage in legal QA. Nevertheless, to address the concern and strengthen the evaluation, we will perform and report an ablation study in the revised manuscript. This will involve training an identical cross-encoder on gold citation labels from the AQuAECHR benchmark and comparing its performance to the attribution-based re-ranker. We will also consider including a comparison to log-probability based training if feasible. The cross-model convergence results already suggest the signal is not entirely generic, but the proposed ablation will provide direct evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external signals and benchmark

full rationale

The paper trains a cross-encoder re-ranker using continuous C-LIME attribution scores (external perturbation method) as the training target and evaluates citation faithfulness plus expert-answer alignment on the independent AQuAECHR benchmark with five-fold cross-validation. The reported cross-model convergence is an empirical observation on held-out data, not a quantity that reduces by the paper's own equations or definitions to a fitted parameter or self-citation. No self-definitional steps, fitted-input-called-prediction, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described method. The central claims remain falsifiable against external gold labels and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, results tables, and any hyperparameter details are unavailable, limiting the ability to enumerate all free parameters or domain assumptions.

free parameters (1)

cross-encoder training hyperparameters
Specific learning rate, batch size, and architecture choices for the re-ranker are not stated in the abstract yet determine the re-ranking signal.

axioms (1)

domain assumption Perturbation-based attribution scores provide a better proxy for citation utility than semantic similarity within a fixed candidate pool
This premise is required for the decision to train the cross-encoder on attribution scores rather than similarity.

pith-pipeline@v0.9.1-grok · 5752 in / 1339 out tokens · 24815 ms · 2026-06-28T10:32:54.070626+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages

[1]

D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, GPT-4 passes the bar exam, 2024. doi:10.2139/ ssrn.4389233

2024
[2]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL: https://arxiv.org/abs/2005.11401.arXiv:2005.11401

Pith/arXiv arXiv 2021
[3]

K. Q. Weidinger, S. T.y.s.s, O. Ichim, M. Grabmair, AQuAECHR: Attributed question answering for European court of human rights, 2025. URL: https://aclanthology.org/2025.findings-acl.74/. doi:10.18653/v1/2025.findings-acl.74

work page doi:10.18653/v1/2025.findings-acl.74 2025
[4]

T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations,
[5]

URL: https://arxiv.org/abs/2305.14627.arXiv:2305.14627

arXiv
[6]

Wallat, M

J. Wallat, M. Heuss, M. de Rijke, A. Anand, Correctness is not faithfulness in rag attributions, 2024. URL: https://arxiv.org/abs/2412.18004.arXiv:2412.18004

arXiv 2024
[7]

ISBN 979-8-89176-251-0

L. Monteiro Paes, D. Wei, H. J. Do, Others, Multi-level explanations for generative language mod- els, 2025. URL: https://aclanthology.org/2025.acl-long.1553/. doi:10.18653/v1/2025.acl-long. 1553

work page doi:10.18653/v1/2025.acl-long 2025
[8]

Nogueira, K

R. Nogueira, K. Cho, Passage re-ranking with bert, 2020. URL: https://arxiv.org/abs/1901.04085. arXiv:1901.04085

Pith/arXiv arXiv 2020
[9]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023. URL: https://arxiv.org/abs/2307.03172. arXiv:2307.03172

Pith/arXiv arXiv 2023
[10]

Randl, G

K. Randl, G. Rocchietti, A. Henriksson, Z. Abedjan, T. Lindgren, J. Pavlopoulos, Rag-e: Quantifying retriever-generator alignment and failure modes, 2026. URL: https://arxiv.org/abs/2601.21803. arXiv:2601.21803

arXiv 2026
[11]

Nogueira, Z

R. Nogueira, Z. Jiang, J. Lin, Document ranking with a pretrained sequence-to-sequence model,
[12]

URL: https://arxiv.org/abs/2003.06713.arXiv:2003.06713

arXiv 2003
[13]

Zhuang, Z

H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma, J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses, 2022. URL: https://arxiv.org/abs/2210.10634. arXiv:2210.10634

arXiv 2022
[14]

X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-tuning llama for multi-stage text retrieval, 2023. URL: https://arxiv.org/abs/2310.08319.arXiv:2310.08319

arXiv 2023
[15]

Pradeep, S

R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL: https://arxiv.org/abs/2312.02724.arXiv:2312.02724

Pith/arXiv arXiv 2023
[16]

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-augmented generation for large language models: A survey, 2024. URL: https://arxiv.org/abs/2312.10997. arXiv:2312.10997

Pith/arXiv arXiv 2024
[17]

W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Is chatgpt good at search? investigating large language models as re-ranking agents, 2024. URL: https://arxiv.org/abs/2304. 09542.arXiv:2304.09542

arXiv 2024
[18]

P. Jia, D. Xu, X. Li, Z. Du, X. Li, Y. Wang, Y. Wang, Q. Liu, M. Wang, H. Guo, R. Tang, X. Zhao, Bridging relevance and reasoning: Rationale distillation in retrieval-augmented generation, 2025. URL: https://arxiv.org/abs/2412.08519.arXiv:2412.08519

arXiv 2025
[19]

Rashkin, V

H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, D. Reitter, Measuring attribution in natural language generation models, 2022. URL: https://arxiv. org/abs/2112.12870.arXiv:2112.12870

arXiv 2022
[20]

Gekhman, J

Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, I. Szpektor, Trueteacher: Learning factual con- sistency evaluation with large language models, 2023. URL: https://arxiv.org/abs/2305.11171. arXiv:2305.11171

arXiv 2023
[21]

Pipitone, G

N. Pipitone, G. H. Alami, Legalbench-rag: A benchmark for retrieval-augmented generation in the legal domain, 2024. URL: https://arxiv.org/abs/2408.10343.arXiv:2408.10343

arXiv 2024
[22]

Chalkidis, M

I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis, N. Aletras, I. Androutsopoulos, P. Malakasiotis, Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases, 2021. URL: https://arxiv.org/abs/2103.13084.arXiv:2103.13084

arXiv 2021
[23]

Y. Xu, J. Gao, X. Yu, Y. Xue, B. Bi, H. Shen, X. Cheng, Training a utility-based retriever through shared context attribution for retrieval-augmented language models, 2026. URL: https://arxiv.org/ abs/2504.00573.arXiv:2504.00573

arXiv 2026
[24]

J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Ábrego, J. Ma, V. Y. Zhao, Y. Luan, K. B. Hall, M.-W. Chang, Y. Yang, Large dual encoders are generalizable retrievers, 2021. URL: https://arxiv.org/abs/2112.07899. arXiv:2112.07899

arXiv 2021
[25]

Reimers, I

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL: https://arxiv.org/abs/1908.10084.arXiv:1908.10084

Pith/arXiv arXiv 2019
[26]

Lin, ROUGE: A package for automatic evaluation of summaries, 2004

C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, 2004. URL: https:// aclanthology.org/W04-1013/. Table 5 Key passage rank changes under re-ranking (Mistral pointwise). GTR = original retriever rank within top-50. RR = re-ranked position in selected top-10. Passages marked with ⋆ appear in the gold citation or target answer. “– ” indicate...

2004
[27]

criteria

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL: https://arxiv.org/ abs/1910.10683.arXiv:1910.10683. A. Qualitative Examples This appendix presents two representative questions contrasting baseline (GTR top-10) and re-ra...

Pith/arXiv arXiv 2023

[1] [1]

D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, GPT-4 passes the bar exam, 2024. doi:10.2139/ ssrn.4389233

2024

[2] [2]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL: https://arxiv.org/abs/2005.11401.arXiv:2005.11401

Pith/arXiv arXiv 2021

[3] [3]

K. Q. Weidinger, S. T.y.s.s, O. Ichim, M. Grabmair, AQuAECHR: Attributed question answering for European court of human rights, 2025. URL: https://aclanthology.org/2025.findings-acl.74/. doi:10.18653/v1/2025.findings-acl.74

work page doi:10.18653/v1/2025.findings-acl.74 2025

[4] [4]

T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations,

[5] [5]

URL: https://arxiv.org/abs/2305.14627.arXiv:2305.14627

arXiv

[6] [6]

Wallat, M

J. Wallat, M. Heuss, M. de Rijke, A. Anand, Correctness is not faithfulness in rag attributions, 2024. URL: https://arxiv.org/abs/2412.18004.arXiv:2412.18004

arXiv 2024

[7] [7]

ISBN 979-8-89176-251-0

L. Monteiro Paes, D. Wei, H. J. Do, Others, Multi-level explanations for generative language mod- els, 2025. URL: https://aclanthology.org/2025.acl-long.1553/. doi:10.18653/v1/2025.acl-long. 1553

work page doi:10.18653/v1/2025.acl-long 2025

[8] [8]

Nogueira, K

R. Nogueira, K. Cho, Passage re-ranking with bert, 2020. URL: https://arxiv.org/abs/1901.04085. arXiv:1901.04085

Pith/arXiv arXiv 2020

[9] [9]

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How language models use long contexts, 2023. URL: https://arxiv.org/abs/2307.03172. arXiv:2307.03172

Pith/arXiv arXiv 2023

[10] [10]

Randl, G

K. Randl, G. Rocchietti, A. Henriksson, Z. Abedjan, T. Lindgren, J. Pavlopoulos, Rag-e: Quantifying retriever-generator alignment and failure modes, 2026. URL: https://arxiv.org/abs/2601.21803. arXiv:2601.21803

arXiv 2026

[11] [11]

Nogueira, Z

R. Nogueira, Z. Jiang, J. Lin, Document ranking with a pretrained sequence-to-sequence model,

[12] [12]

URL: https://arxiv.org/abs/2003.06713.arXiv:2003.06713

arXiv 2003

[13] [13]

Zhuang, Z

H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma, J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses, 2022. URL: https://arxiv.org/abs/2210.10634. arXiv:2210.10634

arXiv 2022

[14] [14]

X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-tuning llama for multi-stage text retrieval, 2023. URL: https://arxiv.org/abs/2310.08319.arXiv:2310.08319

arXiv 2023

[15] [15]

Pradeep, S

R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL: https://arxiv.org/abs/2312.02724.arXiv:2312.02724

Pith/arXiv arXiv 2023

[16] [16]

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-augmented generation for large language models: A survey, 2024. URL: https://arxiv.org/abs/2312.10997. arXiv:2312.10997

Pith/arXiv arXiv 2024

[17] [17]

W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Is chatgpt good at search? investigating large language models as re-ranking agents, 2024. URL: https://arxiv.org/abs/2304. 09542.arXiv:2304.09542

arXiv 2024

[18] [18]

P. Jia, D. Xu, X. Li, Z. Du, X. Li, Y. Wang, Y. Wang, Q. Liu, M. Wang, H. Guo, R. Tang, X. Zhao, Bridging relevance and reasoning: Rationale distillation in retrieval-augmented generation, 2025. URL: https://arxiv.org/abs/2412.08519.arXiv:2412.08519

arXiv 2025

[19] [19]

Rashkin, V

H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc, D. Reitter, Measuring attribution in natural language generation models, 2022. URL: https://arxiv. org/abs/2112.12870.arXiv:2112.12870

arXiv 2022

[20] [20]

Gekhman, J

Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, I. Szpektor, Trueteacher: Learning factual con- sistency evaluation with large language models, 2023. URL: https://arxiv.org/abs/2305.11171. arXiv:2305.11171

arXiv 2023

[21] [21]

Pipitone, G

N. Pipitone, G. H. Alami, Legalbench-rag: A benchmark for retrieval-augmented generation in the legal domain, 2024. URL: https://arxiv.org/abs/2408.10343.arXiv:2408.10343

arXiv 2024

[22] [22]

Chalkidis, M

I. Chalkidis, M. Fergadiotis, D. Tsarapatsanis, N. Aletras, I. Androutsopoulos, P. Malakasiotis, Paragraph-level rationale extraction through regularization: A case study on european court of human rights cases, 2021. URL: https://arxiv.org/abs/2103.13084.arXiv:2103.13084

arXiv 2021

[23] [23]

Y. Xu, J. Gao, X. Yu, Y. Xue, B. Bi, H. Shen, X. Cheng, Training a utility-based retriever through shared context attribution for retrieval-augmented language models, 2026. URL: https://arxiv.org/ abs/2504.00573.arXiv:2504.00573

arXiv 2026

[24] [24]

J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Ábrego, J. Ma, V. Y. Zhao, Y. Luan, K. B. Hall, M.-W. Chang, Y. Yang, Large dual encoders are generalizable retrievers, 2021. URL: https://arxiv.org/abs/2112.07899. arXiv:2112.07899

arXiv 2021

[25] [25]

Reimers, I

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL: https://arxiv.org/abs/1908.10084.arXiv:1908.10084

Pith/arXiv arXiv 2019

[26] [26]

Lin, ROUGE: A package for automatic evaluation of summaries, 2004

C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, 2004. URL: https:// aclanthology.org/W04-1013/. Table 5 Key passage rank changes under re-ranking (Mistral pointwise). GTR = original retriever rank within top-50. RR = re-ranked position in selected top-10. Passages marked with ⋆ appear in the gold citation or target answer. “– ” indicate...

2004

[27] [27]

criteria

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL: https://arxiv.org/ abs/1910.10683.arXiv:1910.10683. A. Qualitative Examples This appendix presents two representative questions contrasting baseline (GTR top-10) and re-ra...

Pith/arXiv arXiv 2023