Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

Shao Kan

arxiv: 2605.21949 · v1 · pith:3FN5QUX7new · submitted 2026-05-21 · 💻 cs.CL

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

Shao Kan This is my paper

Pith reviewed 2026-05-22 06:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords medical RAGclaim verificationcertificationintent-aware selectormixed evidencehigh-risk QAretrieval-augmented generationunsupported claim risk

0 comments

The pith

Medical RAG systems can decompose responses into claims scored against evidence and mapped by an intent-aware selector to full, partial, conflict or abstain actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical RAG systems in high-risk settings usually make a single answer-or-abstain choice, yet real queries often involve evidence that supports one claim, qualifies another, and contradicts a third. The paper introduces claim-selective certification: responses are broken into verifiable claims, each scored against retrieved sources, then routed by an intent-aware selector into one of four actions. On the primary weak-label protocol using real-source-only rows, the system records zero unsupported-claim risk and near-perfect partial-action-understanding precision on both dev and test splits while preserving high action accuracy. A reader would care because the approach separates the prediction of what to do from the evidence-linked verification of each claim, giving finer safety control than whole-answer abstention. The resulting interface is shown to work under naturally occurring mixed-evidence conditions.

Core claim

By decomposing each generated response into verifiable claims, scoring those claims against retrieved evidence, and applying an intent-aware selector that assigns one of four labels (full, partial, conflict, abstain), the system achieves UCCR of 0.0000, PAU of 1.0000 on dev and 0.9967 on test, PAU Precision of 0.9901 on dev and 0.9739 on test, and action accuracy of 0.9204 on dev and 0.8997 on test when evaluated on the real-source-only rows of the weak-label certificate protocol.

What carries the argument

The intent-aware selector that separates action-label prediction from evidence-linked claim selection under mixed evidence.

If this is right

UCCR remains zero within the certificate definition on both development and test sets covering non-abstain actions.
A source-missing counterfactual slice can be used to evaluate the abstain decision when evidence is empty.
Shortcut controls quantify how much of the action-label prior is explained by source and intent metadata alone.
Source-novel and evidence-novel slices characterize the boundaries of transfer performance.
The interface cleanly separates action prediction from evidence-linked claim selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same claim-decomposition and selector structure could be tested on legal or financial QA tasks where evidence is also mixed.
Replacing the current claim extractor with a lighter model might preserve the zero UCCR while lowering latency for real-time use.
Running the protocol on a new medical corpus with deliberately injected contradictions would test whether the reported transfer boundaries hold.
Modular replacement of the evidence scorer alone could isolate whether gains come mainly from claim granularity or from the selector.

Load-bearing premise

The weak-label certificate protocol and its real-source-only dev/test rows accurately represent naturally occurring mixed-evidence scenarios in high-risk medical QA.

What would settle it

A collection of medical questions containing mixed supporting, conditional, and contradicting evidence where the system issues a non-abstain action that includes at least one unsupported claim.

Figures

Figures reproduced from arXiv: 2605.21949 by Shao Kan.

**Figure 1.** Figure 1: System architecture. The pipeline combines template-based claim decomposition with explicit question intent, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Shortcut and perturbation controls on the primary split. Metadata-only majority rows are action-only controls [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Baseline operating map on the primary split. Binary-form baselines reduce unsupported-claim risk by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Source/evidence-novel boundary. The full selector keeps the certificate target as overlap constraints tighten, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Extended source-level diagnostics for the full system. OpenFDA has the highest action accuracy, whereas [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Action confusion on the strongest and weakest sources. PubMedQA errors are dominated by [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Selective-prediction view on the primary split. The threshold-only selector traces a tunable risk–coverage [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to {full, partial, conflict, abstain}. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces claim decomposition plus an intent-aware selector to certify medical RAG outputs under mixed evidence, with clean reported metrics on a filtered subset.

read the letter

The main takeaway is a concrete way to move beyond binary answer-or-abstain in medical RAG. The work decomposes a response into claims, scores each against retrieved evidence, and routes the result through an intent-aware selector into one of four actions: full, partial, conflict, or abstain. On the real-source-only dev and test rows the system hits UCCR of zero, PAU near 1, and action accuracy around 0.90 on roughly 300 examples each. That is a tidy result under their protocol.

Referee Report

2 major / 2 minor

Summary. The paper proposes claim-selective certification for high-risk medical RAG systems. Generated responses are decomposed into verifiable claims, each scored against retrieved evidence, and mapped by an intent-aware selector to one of four actions (full, partial, conflict, abstain). On the primary weak-label certificate protocol, the real-source-only dev/test rows yield UCCR=0.0000, PAU=1.0000/0.9967, PAU Precision=0.9901/0.9739, and action accuracy=0.9204/0.8997 (n=314/319). Additional controls include shortcut analyses, source/evidence-novel slices, and a source-missing counterfactual.

Significance. If the decomposition and scoring pipeline prove robust, the framework offers a more granular safety interface than binary abstain decisions for medical QA under mixed evidence. Strengths include the use of shortcut controls to quantify metadata priors and counterfactual slices to probe boundaries. The zero UCCR and near-perfect PAU on the reported protocol are notable, but overall significance hinges on whether the real-source-only filter faithfully samples realistic mixed-evidence distributions.

major comments (2)

[Abstract and §3 (protocol)] Abstract and primary weak-label certificate protocol description: the real-source-only dev/test rows are stated to cover 'naturally occurring non-abstain actions,' yet the separate source-missing counterfactual slice implies systematic exclusion of cases with absent or contradictory sources. This filter choice is load-bearing for the headline claim that UCCR=0.0000 and PAU≈1.0 demonstrate reliable claim-selective certification under mixed evidence; without explicit quantification of how many mixed-evidence conflicts are dropped, the metrics risk reflecting an easier subset.
[Results section] Evaluation results (dev/test rows): action accuracy is reported at 0.9204/0.8997, but no per-action or per-claim-type breakdown (e.g., conflict vs. partial handling) or error analysis is provided. This omission weakens the supporting claim that the intent-aware selector reliably separates action prediction from evidence-linked claim selection.

minor comments (2)

[Notation and metrics] The acronyms UCCR, PAU, and PAU Precision are used throughout but lack a single, self-contained formal definition; adding an explicit notation table or appendix would improve clarity.
[Tables/figures] A consolidated table comparing all slices (real-source-only, source-missing, evidence-novel) would make the transfer-boundary analysis easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: Abstract and §3 (protocol)] Abstract and primary weak-label certificate protocol description: the real-source-only dev/test rows are stated to cover 'naturally occurring non-abstain actions,' yet the separate source-missing counterfactual slice implies systematic exclusion of cases with absent or contradictory sources. This filter choice is load-bearing for the headline claim that UCCR=0.0000 and PAU≈1.0 demonstrate reliable claim-selective certification under mixed evidence; without explicit quantification of how many mixed-evidence conflicts are dropped, the metrics risk reflecting an easier subset.

Authors: The real-source-only filter is applied to isolate evaluation on instances with available evidence, allowing assessment of claim decomposition and selective certification under mixed but present sources. The source-missing counterfactual is reported separately to probe the abstain case. We agree that quantifying the excluded cases would clarify the scope. In revision we will add the count and proportion of instances removed by the real-source-only criterion on dev and test, together with a short characterization of whether excluded items disproportionately involve conflicts or contradictions. revision: yes
Referee: [Results section] Evaluation results (dev/test rows): action accuracy is reported at 0.9204/0.8997, but no per-action or per-claim-type breakdown (e.g., conflict vs. partial handling) or error analysis is provided. This omission weakens the supporting claim that the intent-aware selector reliably separates action prediction from evidence-linked claim selection.

Authors: We concur that per-action metrics and error analysis would better substantiate the selector's behavior. In the revised results section we will include a table reporting accuracy, precision, and recall for each action (full, partial, conflict, abstain) on both dev and test, plus a concise error analysis of misclassifications, with particular attention to conflict and partial cases and how they relate to the upstream claim-scoring step. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines a claim-selective certification protocol for medical RAG and reports empirical metrics (UCCR, PAU, action accuracy) on real-source-only dev/test rows. These metrics are described as measuring unsupported-claim risk within the certificate definition, but the values are computed on specific held-out data splits rather than being forced by construction from the system inputs or protocol definition. No equations, self-citations, or fitted parameters are shown that reduce the central claims (decomposition into claims, intent-aware selector, separation of action prediction from evidence-linked selection) to tautologies or renamed inputs. The evaluation includes shortcut controls and counterfactual slices, making the results falsifiable on the provided data distributions. The protocol's focus on non-abstain actions is an explicit design choice for the primary weak-label setting, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities are identifiable from the text.

pith-pipeline@v0.9.0 · 5744 in / 1243 out tokens · 45436 ms · 2026-05-22T06:55:29.818312+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 9

work page 2020
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880. Association for Computational Linguistics, 2021

work page 2021
[4]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas O˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781. Association for Computational Linguistics, 2020

work page 2020
[5]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Query rewriting in retrieval-augmented large language models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval-augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13679–13690. Association for Computational Linguistics, 2023

work page 2023
[7]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5664–5687. Association for Computational Linguistics, 2023

work page 2023
[8]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in neural information processing systems, pages 4878–4887, 2017

work page 2017
[9]

Selective question answering under domain shift

Amita Kamath, Robin Jia, and Percy Liang. Selective question answering under domain shift. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684–5696. Association for Computational Linguistics, 2020

work page 2020
[10]

A dataset for answering time-sensitive questions.arXiv preprint arXiv:2108.06314, 2021

Wenhu Chen, Xinyi Wang, and William Yang Wang. A dataset for answering time-sensitive questions.arXiv preprint arXiv:2108.06314, 2021

work page arXiv 2021
[11]

Knowing what you know: Calibrating dialogue belief state distributions via ensembles

Carel van Niekerk, Michael Heck, Christian Geishauser, Hsien-chin Lin, Nurul Lubis, Marco Moresi, and Milica Gasic. Knowing what you know: Calibrating dialogue belief state distributions via ensembles. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3096–3102. Association for Computational Linguistics, 2020

work page 2020
[12]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations, 2023

work page 2023
[13]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919. Association for Computational Linguistics, 2020

work page 1906
[14]

Faith- dial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022

Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo Maria Ponti, and Siva Reddy. Faith- dial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022

work page 2022
[15]

Measuring attribution in natural language generation models

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, pages 1–64, 2023

work page 2023
[16]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 809–819, 2018

work page 2018
[17]

Get your vitamin c! robust fact verification with contrastive evidence

Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 624–640. Association for Computational Linguistics, 2021. 10

work page 2021
[18]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488. Association for Computational Linguistics, 2023

work page 2023
[19]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

work page 2023
[20]

Long-form factuality in large language models

Cosmo Du, Nathan Hu, Da Huang, Jie Huang, Quoc Le, Ruibo Liu, Yifeng Lu, Daiyi Peng, Xinying Song, Dustin Tran, Jerry Wei, and Chengrun Yang. Long-form factuality in large language models. InAdvances in Neural Information Processing Systems 37, pages 80756–80827, 2024

work page 2024
[21]

Minicheck: Efficient fact-checking of llms on grounding documents

Liyan Tang, Philippe Laban, and Greg Durrett. Minicheck: Efficient fact-checking of llms on grounding documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8818–8847, 2024

work page 2024
[22]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 107–112. Association for Computational Linguist...

work page 2018
[23]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448. Association for Computational Linguistics, 2019

work page 2019
[24]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912. Association for Computational Linguistics, 2020

work page 2020
[25]

Smith, and Yejin Choi

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9275–9293. Association for Computational Linguistics, 2020

work page 2020
[26]

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard L. Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Brian Earnshaw, Imran Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchm...

work page 2021
[27]

Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering

Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. InProceedings of the 18th BioNLP Workshop and Shared Task, pages 370–379, 2019

work page 2019
[28]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146, 2019

work page internal anchor Pith review arXiv 1909
[29]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021
[30]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

work page arXiv 2022
[31]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 11

work page 2023
[32]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. A Appendix Overview This appendix reports the claim summary, additional diagnostics, uncertainty estimates, reproducibility commands, and asset information used to support the main pa...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 9

work page 2020

[2] [2]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880. Association for Computational Linguistics, 2021

work page 2021

[4] [4]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas O˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781. Association for Computational Linguistics, 2020

work page 2020

[5] [5]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection.arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Query rewriting in retrieval-augmented large language models

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval-augmented large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13679–13690. Association for Computational Linguistics, 2023

work page 2023

[7] [7]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5664–5687. Association for Computational Linguistics, 2023

work page 2023

[8] [8]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in neural information processing systems, pages 4878–4887, 2017

work page 2017

[9] [9]

Selective question answering under domain shift

Amita Kamath, Robin Jia, and Percy Liang. Selective question answering under domain shift. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684–5696. Association for Computational Linguistics, 2020

work page 2020

[10] [10]

A dataset for answering time-sensitive questions.arXiv preprint arXiv:2108.06314, 2021

Wenhu Chen, Xinyi Wang, and William Yang Wang. A dataset for answering time-sensitive questions.arXiv preprint arXiv:2108.06314, 2021

work page arXiv 2021

[11] [11]

Knowing what you know: Calibrating dialogue belief state distributions via ensembles

Carel van Niekerk, Michael Heck, Christian Geishauser, Hsien-chin Lin, Nurul Lubis, Marco Moresi, and Milica Gasic. Knowing what you know: Calibrating dialogue belief state distributions via ensembles. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 3096–3102. Association for Computational Linguistics, 2020

work page 2020

[12] [12]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations, 2023

work page 2023

[13] [13]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919. Association for Computational Linguistics, 2020

work page 1906

[14] [14]

Faith- dial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022

Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo Maria Ponti, and Siva Reddy. Faith- dial: A faithful benchmark for information-seeking dialogue.Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022

work page 2022

[15] [15]

Measuring attribution in natural language generation models

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models. Computational Linguistics, pages 1–64, 2023

work page 2023

[16] [16]

Fever: a large-scale dataset for fact extraction and verification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 809–819, 2018

work page 2018

[17] [17]

Get your vitamin c! robust fact verification with contrastive evidence

Tal Schuster, Adam Fisch, and Regina Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 624–640. Association for Computational Linguistics, 2021. 10

work page 2021

[18] [18]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488. Association for Computational Linguistics, 2023

work page 2023

[19] [19]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

work page 2023

[20] [20]

Long-form factuality in large language models

Cosmo Du, Nathan Hu, Da Huang, Jie Huang, Quoc Le, Ruibo Liu, Yifeng Lu, Daiyi Peng, Xinying Song, Dustin Tran, Jerry Wei, and Chengrun Yang. Long-form factuality in large language models. InAdvances in Neural Information Processing Systems 37, pages 80756–80827, 2024

work page 2024

[21] [21]

Minicheck: Efficient fact-checking of llms on grounding documents

Liyan Tang, Philippe Laban, and Greg Durrett. Minicheck: Efficient fact-checking of llms on grounding documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8818–8847, 2024

work page 2024

[22] [22]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 107–112. Association for Computational Linguist...

work page 2018

[23] [23]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448. Association for Computational Linguistics, 2019

work page 2019

[24] [24]

Beyond accuracy: Behavioral testing of NLP models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of NLP models with CheckList. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912. Association for Computational Linguistics, 2020

work page 2020

[25] [25]

Smith, and Yejin Choi

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9275–9293. Association for Computational Linguistics, 2020

work page 2020

[26] [26]

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard L. Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Brian Earnshaw, Imran Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchm...

work page 2021

[27] [27]

Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering

Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. InProceedings of the 18th BioNLP Workshop and Shared Task, pages 370–379, 2019

work page 2019

[28] [28]

PubMedQA: A Dataset for Biomedical Research Question Answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146, 2019

work page internal anchor Pith review arXiv 1909

[29] [29]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021

[30] [30]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering.arXiv preprint arXiv:2203.14371, 2022

work page arXiv 2022

[31] [31]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 11

work page 2023

[32] [32]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023. A Appendix Overview This appendix reports the claim summary, additional diagnostics, uncertainty estimates, reproducibility commands, and asset information used to support the main pa...

work page internal anchor Pith review Pith/arXiv arXiv 2023