CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints

Federica Bologna , Tiffany Pan , Matthew Wilkens , Yue Guo , Lucy Lu Wang

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords agreementcoarseevaluationjudgmentsmulti-paragraphannotationsclinicalcorrectness

read the original abstract

Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...