pith. sign in

arxiv: 2606.29467 · v1 · pith:VXCN43UCnew · submitted 2026-06-28 · 💻 cs.CL · cs.IR

mamabench and mamaretrieval: Benchmarks for Evaluating Medical Retrieval-Augmented Generation in Maternal, Neonatal, and Reproductive Health

Pith reviewed 2026-06-30 07:36 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords maternal healthretrieval-augmented generationmedical benchmarksquestion answeringrelevance gradingclinical guidelinesnurse-midwife queries
0
0 comments X

The pith

Two benchmarks assembled from expert sources enable evaluation of retrieval-augmented generation for maternal, neonatal, and reproductive health.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases mamabench, a set of 25,949 scope-filtered questions drawn from seven existing expert sources, and mamaretrieval, a collection of 3,185 queries paired with 0-6 graded relevance labels over 63,650 guideline chunks. These resources address the absence of public benchmarks covering the specific questions nurse-midwives ask and the lack of chunk-level graded relevance data for maternal-health retrieval. The construction choices include filtering expert material rather than writing new items, using continuous relevance grades instead of binary labels, and reporting label quality metrics such as scope-classifier agreement and pooling audits. A sympathetic reader would see this as supplying the missing test beds needed to measure how well language models and retrieval systems perform on this clinical domain.

Core claim

By assembling and scope-filtering existing expert-authored sources into mamabench for question answering and creating mamaretrieval with decomposed graded relevance labels over a maternal-health guideline corpus, the benchmarks provide the first public resources for evaluating retrieval-augmented generation systems on the maternal, neonatal, and reproductive health questions that nurse-midwives actually encounter, while explicitly disclosing the limits of those labels.

What carries the argument

mamabench, a 25,949-item QA set filtered from seven expert sources across multiple-choice, short-answer, and rubric tracks, together with mamaretrieval, 3,185 queries paired with 0-6 graded relevance labels over 63,650 chunks using a rubric that separates answer-providing chunks from merely topical ones.

If this is right

  • LLM judges for rubric-graded answers can be calibrated using the re-scoped physician-labelled meta-evaluation from HealthBench.
  • Deployed on-device maternal-health assistants can be evaluated end-to-end against both QA accuracy and retrieval quality.
  • Retrieval systems can be trained and measured on continuous relevance grades rather than binary decisions.
  • Future benchmark creators can adopt the same practice of reporting scope-classifier agreement, frontier-judge checks, and pooling-completeness audits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The graded-relevance approach could be extended to other medical guideline corpora where topic overlap is common but direct answers are rare.
  • Similar assembly from expert sources might reduce the cost of creating domain-specific benchmarks in other clinical areas that already have published guidelines.
  • The explicit disclosure of label limits provides a template that could raise standards for how medical AI evaluation resources are released.

Load-bearing premise

The seven existing expert-authored sources, after scope filtering, adequately represent the range of maternal, neonatal, and reproductive-health questions that nurse-midwives actually ask in practice.

What would settle it

A survey or log of real nurse-midwife clinical queries in which more than a small fraction fall outside the topics covered by the seven filtered sources.

read the original abstract

Medical question-answering benchmarks rarely cover the maternal, neonatal, child, and reproductive-health questions a nurse-midwife asks, and, to our knowledge, no public chunk-level relevance benchmark exists for maternal-health guideline retrieval. We release two benchmarks that fill these gaps. mamabench is a scope-filtered QA set of 25,949 items assembled from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks; to help users calibrate the LLM judge that scores the rubric track, we re-scope HealthBench's physician-labelled meta-evaluation to the domain. mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk maternal-health guideline corpus, using a decomposed rubric that distinguishes a chunk that answers a query from one merely on its topic. Three decisions shape both: assemble and filter expert sources rather than author questions, grade relevance rather than binarise it, and measure and disclose the limits of the labels -- scope-classifier agreement, a frontier-judge check, and a pooling-completeness audit -- rather than treat them as an oracle. A companion paper uses the benchmarks to evaluate a deployed on-device assistant; both are released openly for research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper releases two benchmarks for medical retrieval-augmented generation focused on maternal, neonatal, and reproductive health. mamabench is a scope-filtered QA dataset of 25,949 items drawn from seven existing expert-authored sources across multiple-choice, short-answer, and rubric-graded tracks (with a re-scoped HealthBench meta-evaluation for the rubric track). mamaretrieval pairs 3,185 clinical queries with graded (0-6) relevance labels over a 63,650-chunk guideline corpus. The work emphasizes assembling and filtering expert sources rather than authoring new questions, using graded rather than binary relevance, and disclosing label limits such as scope-classifier agreement, frontier-judge checks, and pooling-completeness audits.

Significance. If the filtered sources prove representative of real clinical questions in the target setting, the benchmarks would address a documented gap in domain-specific resources for maternal-health QA and chunk-level retrieval evaluation. The transparent disclosure of label-quality metrics and the reuse of expert sources are methodological strengths that increase the resources' utility for downstream LLM and RAG assessment; the companion paper on a deployed on-device assistant further demonstrates practical relevance.

major comments (1)
  1. [Abstract and source-selection description] The claim that the benchmarks fill the stated gaps in coverage for the questions a nurse-midwife asks depends on the representativeness of the seven scope-filtered expert sources. The manuscript reports no comparison of the resulting 25,949 QA items or 3,185 queries against actual clinical query logs, practitioner surveys, or usage data from maternal-health settings (Abstract; source-selection description). This assumption is load-bearing for the central contribution and remains untested.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the review and for identifying the load-bearing assumption regarding source representativeness. We respond to the single major comment below.

read point-by-point responses
  1. Referee: [Abstract and source-selection description] The claim that the benchmarks fill the stated gaps in coverage for the questions a nurse-midwife asks depends on the representativeness of the seven scope-filtered expert sources. The manuscript reports no comparison of the resulting 25,949 QA items or 3,185 queries against actual clinical query logs, practitioner surveys, or usage data from maternal-health settings (Abstract; source-selection description). This assumption is load-bearing for the central contribution and remains untested.

    Authors: We agree that a direct comparison against clinical query logs, practitioner surveys, or usage data would provide stronger evidence. No such public logs or surveys exist for this narrow domain, and obtaining them would require resources and approvals outside the scope of a benchmark-release paper. Source selection was instead driven by the established authority of the seven expert-authored collections (WHO guidelines, standard midwifery texts, and similar resources) that are routinely used in training and reference for nurse-midwives and equivalent practitioners. The scope classifier was trained and audited specifically to retain only items within the target maternal, neonatal, and reproductive-health scope. We have revised the source-selection description and limitations sections to state more explicitly that the benchmarks constitute a high-quality proxy derived from authoritative expert sources rather than a validated sample of real-world query distributions. revision: partial

standing simulated objections not resolved
  • Direct empirical validation of representativeness against private clinical query logs or new practitioner surveys cannot be performed with available resources.

Circularity Check

0 steps flagged

No circularity: benchmark release relies on external sources without self-referential derivations

full rationale

The paper is a data-resource release that assembles and scope-filters existing expert-authored QA sources and guideline corpora into mamabench and mamaretrieval. No equations, fitted parameters, predictions, or derivations appear in the manuscript. The central claims rest on the external provenance of the seven sources and disclosed label-quality audits rather than any internal reduction to the paper's own inputs. Self-citation is limited to a companion paper that applies the benchmarks; it is not load-bearing for the construction or validity claims here. This matches the default non-circular case for resource papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities; the contribution is curation and release of evaluation data.

pith-pipeline@v0.9.1-grok · 5755 in / 1116 out tokens · 31821 ms · 2026-06-30T07:36:37.493131+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv:2505.08775 [cs.CL] https://arxiv.org/abs/2505.08775 Naghmeh Farzi and Laura Dietz

  2. [2]

    arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

    Criteria-Based LLM Relevance Judgments. arXiv:2507.09488 [cs.IR] https://arxiv.org/ abs/2507.09488 Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

  3. [3]

    https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C

    What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences11, 14 (2021). https://github.com/jind11/MedQA Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, and Zhiyong Lu

  4. [4]

    Bioinformatics39, 11 (2023), btad651

    MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics39, 11 (2023), btad651. Omar Khattab and Matei Zaharia

  5. [5]

    arXiv preprint arXiv:2403.20327 , year=

    Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327 [cs.CL] https://arxiv.org/abs/2403.20327 Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu

  6. [6]

    InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Paul Mwaniki, Wycliffe Musau, Lynda Isaaka, Conrad Wanyama, Vinod Menon, Alastair K. Denniston, Xiaoxuan Liu, Mphatso Emmanual-Fabula, Gerald Williams, Bilal A. Mateen, and Ambrose Agweyu

  7. [7]

    https://www.medrxiv

    Benchmarking Large Language Models and Clinicians Using Locally Generated Primary Healthcare Vignettes in Kenya. https://www.medrxiv. org/content/10.1101/2025.10.25.25338798v1 medRxiv preprint 2025.10.25.25338798. 10 Ren Yi 任一 Charles Nimo, Tobi Olatunji, et al

  8. [8]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

    AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). 1948–1973. https://aclanthology.org/2025.acl-long.96/ Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

  9. [9]

    InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC)

    Overview of the TREC 2017 Precision Medicine Track. InProceedings of the Twenty-Sixth Text REtrieval Conference (TREC). Stephen Robertson and Hugo Zaragoza

  10. [10]

    Tefko Saracevic

    The Probabilistic Relevance Framework: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389. Tefko Saracevic

  11. [11]

    Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933

    Relevance: A Review of the Literature and a Framework for Thinking on the Notion in Information Science. Part II.Journal of the American Society for Information Science and Technology58, 13 (2007), 1915–1933. The Lumos AI Labs

  12. [12]

    arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

    UMBRELA: The Open-Source Reproduction of the Bing Relevance Assessor. arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

  13. [13]

    how often to check BP

    ------------------------------------------------------------------------ What fraction of the chunk text is directly useful for answering the specific query? (Not the broader topic -- the specific query.) 0 -- useful content is < 25% of the chunk: long chunk with one buried relevant sentence; mostly off-topic 12 Ren Yi 任一 surrounding text; the answer exis...