pith. machine review for the scientific record. sign in

arxiv: 2510.10415 · v3 · submitted 2025-10-12 · 💻 cs.CL · cs.AI

Recognition: unknown

CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints

Authors on Pith no claims yet
classification 💻 cs.CL cs.AI
keywords agreementcoarseevaluationjudgmentsmulti-paragraphannotationsclinicalcorrectness
0
0 comments X
read the original abstract

Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Table to Cell: Attention for Better Reasoning with TABALIGN

    cs.AI 2026-05 unverdicted novelty 7.0

    TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...