Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Abdine Maiga; Emine Yilmaz; Hossein A. Rahmani; Yinzhu Chen

arxiv: 2601.15161 · v2 · pith:VQ43UB3Rnew · submitted 2026-01-21 · 💻 cs.CL · cs.AI

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Yinzhu Chen , Abdine Maiga , Hossein A. Rahmani , Emine Yilmaz This is my paper

Pith reviewed 2026-05-16 12:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords medical dialogue systemsautomated rubricsLLM evaluationclinical intent alignmentretrieval-augmented generationmulti-agent frameworkHealthBench

0 comments

The pith

A retrieval-augmented multi-agent system automatically generates instance-specific rubrics that ground medical dialogue evaluation in verifiable clinical facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that retrieves authoritative medical evidence, breaks it into atomic facts, and combines those facts with conversation constraints to create detailed, checkable rubrics for judging medical AI responses. This targets the gap where generic LLM judges overlook subtle clinical errors and expert-written rubrics cannot scale. The resulting rubrics produce higher Clinical Intent Alignment scores than GPT-4o on HealthBench and LLMEval-Med, show stronger discrimination between good and bad answers, and can also steer response improvements. If the method holds, evaluation and refinement of medical LLMs become more consistent and evidence-based without proportional increases in human effort.

Core claim

By retrieving authoritative medical content, decomposing it into atomic facts, and synthesizing the facts with user-interaction constraints inside a multi-agent workflow, the framework produces instance-specific rubrics whose criteria are verifiable against the source evidence. On HealthBench these rubrics reach 50.20 % CIA and on LLMEval-Med 31.90 % CIA, exceed the GPT-4o baseline, generalize across languages, win 7.8 % more pairwise comparisons with nearly double score margin, and raise response quality by 9.2 % when used for refinement.

What carries the argument

The retrieval-augmented multi-agent framework that decomposes retrieved medical content into atomic facts and synthesizes them with interaction constraints to form verifiable, fine-grained evaluation criteria.

If this is right

Rubrics scale evaluation to large test sets without proportional expert labor while remaining grounded in source evidence.
The same rubrics can be fed back to models to refine outputs, producing a 9.2 % quality lift.
Cross-lingual performance indicates the method can support non-English medical dialogue systems with the same grounding approach.
Discriminative power improves, allowing clearer separation of safe versus unsafe model responses on HealthBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach supplies a concrete mechanism for turning static medical guidelines into dynamic, dialogue-specific checks that could feed into reward models for reinforcement learning.
If the atomic-fact decomposition step generalizes, similar pipelines could be tested for legal, financial, or safety-critical dialogue domains.
The framework creates a natural loop in which evaluation rubrics are regenerated whenever new authoritative guidelines appear, keeping benchmarks current.

Load-bearing premise

Retrieved authoritative medical content can be reliably broken into atomic facts and recombined with interaction constraints to yield accurate, hallucination-free rubrics.

What would settle it

A head-to-head comparison in which expert clinicians score the same set of medical dialogues using both the automated rubrics and independently written expert rubrics; if the automated rubrics miss clinically important errors or add incorrect criteria at a rate higher than the expert rubrics, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2601.15161 by Abdine Maiga, Emine Yilmaz, Hossein A. Rahmani, Yinzhu Chen.

**Figure 2.** Figure 2: Discrimination analysis on the micro-perturbed dataset: (A) Mean score difference between reference and [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Dimension-wise analysis of downstream response refinement under different rubric settings, including [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: An evaluation example from HealthBench(Arora et al., 2025), where a model-generated response is [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety. These risks are hard to assess: subtle clinical errors are often missed by generic metrics and LLM judges using general criteria, while expert-authored fine-grained rubrics are expensive and difficult to scale. In this paper, we propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics. Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria. Evaluated on HealthBench and LLMEval-Med datasets, our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%, significantly outperforming the GPT-4o baseline and demonstrating robust cross-lingual generalization. In discriminative tests on HealthBench, our rubrics yield a 7.8% higher win rate than GPT-4o baseline with nearly double score $\Delta$, while ablation studies confirm its structural necessity. Beyond evaluation, our rubrics effectively guide response refinement, improving quality by 9.2%. This provides a scalable, cross-lingual foundation for both evaluating and improving medical LLMs. The code is available at https://github.com/AmbeChen/Automated-Rubric-Generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The retrieval-augmented multi-agent rubric generator produces clear benchmark gains over GPT-4o but offers no direct check on whether its own rubrics stay free of introduced errors.

read the letter

The main point is that this paper gives a working pipeline to build instance-specific medical rubrics by retrieving evidence, splitting it into atomic facts, and folding in interaction constraints. On HealthBench and LLMEval-Med the method reaches CIA scores of 50.20% and 31.90%, beats the GPT-4o baseline, and delivers a 7.8% higher win rate in head-to-head tests along with a 9.2% quality lift when the rubrics are used for response refinement. Ablations show the pieces are needed, and the code is released, which lets others run the same checks.

Referee Report

2 major / 2 minor

Summary. The paper proposes a retrieval-augmented multi-agent framework to automatically generate instance-specific rubrics for evaluating medical dialogue systems. Retrieved authoritative medical content is decomposed into atomic facts and synthesized with interaction constraints to produce fine-grained, verifiable evaluation criteria. On HealthBench and LLMEval-Med, the framework reports Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%, outperforming a GPT-4o baseline; discriminative tests show a 7.8% higher win rate with nearly double score Δ, ablation studies confirm component necessity, cross-lingual generalization is claimed, and the rubrics improve response refinement by 9.2%. Code is released at the provided GitHub link.

Significance. If the automated rubrics preserve fidelity to source evidence, the work addresses a key bottleneck in medical LLM evaluation by offering a scalable alternative to costly expert rubrics while enabling both assessment and iterative refinement. The reported gains over GPT-4o, the refinement improvement, and public code release are concrete strengths that could support more reliable safety testing of clinical dialogue systems. The cross-lingual results, if robust, further broaden applicability. Significance hinges on demonstrating that the decomposition step does not itself introduce inaccuracies that undermine the evaluation criteria.

major comments (2)

[Abstract / Framework] Abstract and framework description: The central claim that the pipeline 'grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts' to yield 'verifiable' rubrics is load-bearing for all reported metrics (CIA scores of 50.20%/31.90%, 7.8% win-rate lift). No verification mechanism (fact-checking sub-agent, human audit, or measured hallucination rate on the atomic facts) is described, leaving open the possibility that decomposition errors propagate into the rubrics and inflate the observed improvements over GPT-4o.
[Results / Ablation studies] Results section (ablation studies): The statement that 'ablation studies confirm its structural necessity' is used to support the framework's design, yet no quantitative deltas are provided for the ablated components (e.g., removal of retrieval or multi-agent synthesis) on CIA scores or win rates. Without these numbers or a referenced table/figure, the robustness claim cannot be evaluated and remains non-falsifiable from the given evidence.

minor comments (2)

[Abstract] The abstract asserts 'robust cross-lingual generalization' but does not name the languages, the specific test sets, or the magnitude of performance drop across languages, making the claim difficult to assess.
Dataset details (HealthBench and LLMEval-Med) are referenced only by name; adding brief statistics on size, language distribution, and how instance-specific rubrics were applied would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments that help strengthen the rigor of our claims. We address each major point below and will revise the manuscript accordingly to provide the requested evidence and clarifications.

read point-by-point responses

Referee: [Abstract / Framework] Abstract and framework description: The central claim that the pipeline 'grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts' to yield 'verifiable' rubrics is load-bearing for all reported metrics (CIA scores of 50.20%/31.90%, 7.8% win-rate lift). No verification mechanism (fact-checking sub-agent, human audit, or measured hallucination rate on the atomic facts) is described, leaving open the possibility that decomposition errors propagate into the rubrics and inflate the observed improvements over GPT-4o.

Authors: We agree that the absence of an explicit verification mechanism for the atomic-fact decomposition leaves the reliability of the rubrics open to question. In the revised manuscript we will add a dedicated verification subsection that describes a human audit performed on a random sample of 200 atomic facts (sampled across both datasets), reports the measured accuracy rate, and quantifies any observed hallucination or decomposition errors. These results will be used to bound the potential impact on the reported CIA scores and win-rate gains. revision: yes
Referee: [Results / Ablation studies] Results section (ablation studies): The statement that 'ablation studies confirm its structural necessity' is used to support the framework's design, yet no quantitative deltas are provided for the ablated components (e.g., removal of retrieval or multi-agent synthesis) on CIA scores or win rates. Without these numbers or a referenced table/figure, the robustness claim cannot be evaluated and remains non-falsifiable from the given evidence.

Authors: We apologize for the omission of the numerical ablation results. The ablation experiments were performed, and the revised Results section will include a new table (Table 4) that reports CIA scores and win rates for each ablated configuration (no-retrieval, no-multi-agent, no-synthesis) together with the absolute and relative deltas versus the full model. This will make the structural-necessity claim directly falsifiable from the data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with direct benchmark comparisons

full rationale

The paper presents a retrieval-augmented multi-agent system for rubric generation and reports empirical results (CIA scores, win rates, ablation studies) on fixed datasets against GPT-4o. No equations, fitted parameters, derivations, or load-bearing self-citations appear in the provided text. Claims rest on direct measurement rather than any reduction to inputs by construction. The central assumption about atomic-fact decomposition is a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that retrieved medical sources are authoritative and that LLM agents can accurately decompose them into atomic facts without error propagation. No free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Retrieved medical content is authoritative and complete for the evaluated dialogues
Framework grounds all rubrics in retrieved evidence assumed to be reliable.
domain assumption LLM-based decomposition into atomic facts and synthesis with interaction constraints produces verifiable criteria
Core mechanism of the multi-agent pipeline.

pith-pipeline@v0.9.0 · 5553 in / 1287 out tokens · 33219 ms · 2026-05-16T12:12:54.298066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach grounds evaluation in authoritative medical evidence by decomposing retrieved content into atomic facts and synthesizing them with user interaction constraints to form verifiable, fine-grained evaluation criteria.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Guidelines: CDC (site:cdc.gov), WHO (site:who.int), NICE (site:nice.org.uk), Merck Manuals (site:merckmanuals.com)

work page
[2]

Drugs: Drugs.com (site:drugs.com), BNF (site:bnf.nice.org.uk)

work page
[3]

Patient Ed: Mayo Clinic (site:mayoclinic.org), Cleveland Clinic (site:clevelandclinic.org), NHS (site:nhs.uk)

work page
[4]

Research: PubMed (site:ncbi.nlm.nih.gov) Task:

work page
[5]

intent”: “string

Generate 3–5 specific search queries combining medical terms with relevant authoritative sites. IMPORTANT:Output ONLY valid JSON. Example format: {“intent”: “string”, “queries”: [“query1”, “query2”, “query3”]} Table 9: Routing Agent Prompt used to generate targeted search queries over restricted medical domains. 16 Evidence Synthesis Agent Prompt You are ...

work page
[6]

Scraped Text from Web: {raw_text} Instructions:

work page
[7]

synthesis

Check for Conflicts: Determine whether the retrieved text shows differences or inconsistencies between sources. Record this information in the “synthesis” section

work page
[8]

evidence_sources

Extract Facts (evidence_sources): – Populate the “evidence_sources” list. – For each source, extract a representative “key_excerpt”. – Extract concrete recommendations, numerical values, schedules, or thresholds relevant to the query. – If tables are present (e.g., vaccination schedules), summarize them into clear declarative sentences. Do not reference tables

work page
[9]

synthesis

Red Flags: Identify safety warnings, contraindications, or high-risk signals and record them under “synthesis”→“red_flags”

work page
[10]

ATOMIC FACT

Source Attribution: Ensure every extracted entry is associated with a valid source URL. Output Instruction: – Output ONLY valid JSON. – Do not include conversational text. {format_instructions} Table 10: Evidence Synthesis Agent prompt used to consolidate retrieved sources into structured medical evidence blocks. Medical Fact Agent Prompt — Step 1: Atomic...

work page
[11]

Qualitative Statements: Definitions, descriptions, mechanisms, characteristics, or procedural steps

work page
[12]

Quantitative Data: Specific numbers, measurements, timeframes, dosages, or frequencies

work page
[13]

If X, then Y

Conditional Logic: “If X, then Y” statements or dependency rules. Instructions: – Do not omit or miss information; extract raw information segments. – Deconstruct complex sentences into single, standalone premises. Output JSON: { “positive_atomic_facts”: [ “Fact statement 1”, “Fact statement 2” ], “negative_constraints”: [ “Explicit prohibitions”, “Contra...

work page
[14]

Direct Alignment: Retain facts that directly address the user’s question or stated symptoms

work page
[15]

Contextual Necessity: Retain background definitions required for understanding the answer

work page
[16]

Semantic Relevance: Discard facts related to medical conditions, demographics, or treatments not implied by the user query

work page
[17]

relevant_positive_facts

Safety Override: ALW AYS retain all safety_red_flags and negative_constraints, regardless of query specificity. Output JSON: { “relevant_positive_facts”: [], “relevant_negative_constraints”: [], “relevant_red_flags”: [] } Table 12: Medical Fact Agent prompt (Step 2), used to filter atomic facts according to query relevance and safety-preserving constraint...

work page
[18]

User Persona: Infer the user’s likely medical knowledge level and emotional state based on query phrasing

work page
[19]

Missing Context: Identify medically necessary variables (e.g., demographics, medical history, symptom severity) that are required for a safe and accurate response but are not provided in the query

work page
[20]

user_persona

Tone: Determine the appropriate communication style (e.g., reassuring, neutral, cautious, empathetic). Output JSON: { “user_persona”: “...”, “missing_context_questions”: [ “Question 1”, “Question 2” ], “tone”: “...” } Table 13: Interaction Intent Agent prompt used to infer user persona, missing clinical context, and appropriate response tone for safe dial...

work page
[21]

Medical Evidence: A list of verified atomic facts, including symptoms, treatments, contraindications, and safety red flags

work page
[22]

CONSOLIDATION STRATEGY (Cluster & Enumerate): – Group related medical concepts into coherent evaluation criteria

User Intent: The user’s persona, missing contextual requirements, and required communication tone. CONSOLIDATION STRATEGY (Cluster & Enumerate): – Group related medical concepts into coherent evaluation criteria. – You may summarize related items, but must not omit clinically important information. GENERATION STRATEGY (Holistic Coverage): – Do not merely ...

work page
[23]

– High magnitude (−10to−8or8to10): safety-critical or accuracy-critical items

Score Range: Integer values from−10to10. – High magnitude (−10to−8or8to10): safety-critical or accuracy-critical items. – Medium magnitude (−7to−4or4to7): completeness and contextual coverage. – Low magnitude (−3to−1or1to3): minor details or communication style

work page
[24]

Correctly identifies the recommended dosage of 500 mg

Allowed Axes: – accuracy: factual correctness and safety violations. – completeness: coverage of required topics. – context_awareness: asking clarifying questions identified in the intent. – communication_quality: tone, empathy, and clarity. – instruction_following: formatting or explicit constraints. FORMAT CONSTRAINTS: – Aim for comprehensive coverage w...

work page
[25]

{user_query}

User Query: “{user_query}”

work page
[26]

Source Truth: The filtered atomic medical facts together with the identified user intent

work page
[27]

rubrics”: [ { “criterion

Draft Rubrics: The current set of generated evaluation criteria. AUDIT PROCEDURE: PHASE 1: Gap Analysis and Supplementation (CRITICAL) – Scan the Source Truth, including all symptoms, treatments, safety red flags, and contextual questions. – Check whether each item is covered by the draft rubrics. – Action: If any key fact (e.g., a specific drug, symptom,...

work page

[1] [1]

Guidelines: CDC (site:cdc.gov), WHO (site:who.int), NICE (site:nice.org.uk), Merck Manuals (site:merckmanuals.com)

work page

[2] [2]

Drugs: Drugs.com (site:drugs.com), BNF (site:bnf.nice.org.uk)

work page

[3] [3]

Patient Ed: Mayo Clinic (site:mayoclinic.org), Cleveland Clinic (site:clevelandclinic.org), NHS (site:nhs.uk)

work page

[4] [4]

Research: PubMed (site:ncbi.nlm.nih.gov) Task:

work page

[5] [5]

intent”: “string

Generate 3–5 specific search queries combining medical terms with relevant authoritative sites. IMPORTANT:Output ONLY valid JSON. Example format: {“intent”: “string”, “queries”: [“query1”, “query2”, “query3”]} Table 9: Routing Agent Prompt used to generate targeted search queries over restricted medical domains. 16 Evidence Synthesis Agent Prompt You are ...

work page

[6] [6]

Scraped Text from Web: {raw_text} Instructions:

work page

[7] [7]

synthesis

Check for Conflicts: Determine whether the retrieved text shows differences or inconsistencies between sources. Record this information in the “synthesis” section

work page

[8] [8]

evidence_sources

Extract Facts (evidence_sources): – Populate the “evidence_sources” list. – For each source, extract a representative “key_excerpt”. – Extract concrete recommendations, numerical values, schedules, or thresholds relevant to the query. – If tables are present (e.g., vaccination schedules), summarize them into clear declarative sentences. Do not reference tables

work page

[9] [9]

synthesis

Red Flags: Identify safety warnings, contraindications, or high-risk signals and record them under “synthesis”→“red_flags”

work page

[10] [10]

ATOMIC FACT

Source Attribution: Ensure every extracted entry is associated with a valid source URL. Output Instruction: – Output ONLY valid JSON. – Do not include conversational text. {format_instructions} Table 10: Evidence Synthesis Agent prompt used to consolidate retrieved sources into structured medical evidence blocks. Medical Fact Agent Prompt — Step 1: Atomic...

work page

[11] [11]

Qualitative Statements: Definitions, descriptions, mechanisms, characteristics, or procedural steps

work page

[12] [12]

Quantitative Data: Specific numbers, measurements, timeframes, dosages, or frequencies

work page

[13] [13]

If X, then Y

Conditional Logic: “If X, then Y” statements or dependency rules. Instructions: – Do not omit or miss information; extract raw information segments. – Deconstruct complex sentences into single, standalone premises. Output JSON: { “positive_atomic_facts”: [ “Fact statement 1”, “Fact statement 2” ], “negative_constraints”: [ “Explicit prohibitions”, “Contra...

work page

[14] [14]

Direct Alignment: Retain facts that directly address the user’s question or stated symptoms

work page

[15] [15]

Contextual Necessity: Retain background definitions required for understanding the answer

work page

[16] [16]

Semantic Relevance: Discard facts related to medical conditions, demographics, or treatments not implied by the user query

work page

[17] [17]

relevant_positive_facts

Safety Override: ALW AYS retain all safety_red_flags and negative_constraints, regardless of query specificity. Output JSON: { “relevant_positive_facts”: [], “relevant_negative_constraints”: [], “relevant_red_flags”: [] } Table 12: Medical Fact Agent prompt (Step 2), used to filter atomic facts according to query relevance and safety-preserving constraint...

work page

[18] [18]

User Persona: Infer the user’s likely medical knowledge level and emotional state based on query phrasing

work page

[19] [19]

Missing Context: Identify medically necessary variables (e.g., demographics, medical history, symptom severity) that are required for a safe and accurate response but are not provided in the query

work page

[20] [20]

user_persona

Tone: Determine the appropriate communication style (e.g., reassuring, neutral, cautious, empathetic). Output JSON: { “user_persona”: “...”, “missing_context_questions”: [ “Question 1”, “Question 2” ], “tone”: “...” } Table 13: Interaction Intent Agent prompt used to infer user persona, missing clinical context, and appropriate response tone for safe dial...

work page

[21] [21]

Medical Evidence: A list of verified atomic facts, including symptoms, treatments, contraindications, and safety red flags

work page

[22] [22]

CONSOLIDATION STRATEGY (Cluster & Enumerate): – Group related medical concepts into coherent evaluation criteria

User Intent: The user’s persona, missing contextual requirements, and required communication tone. CONSOLIDATION STRATEGY (Cluster & Enumerate): – Group related medical concepts into coherent evaluation criteria. – You may summarize related items, but must not omit clinically important information. GENERATION STRATEGY (Holistic Coverage): – Do not merely ...

work page

[23] [23]

– High magnitude (−10to−8or8to10): safety-critical or accuracy-critical items

Score Range: Integer values from−10to10. – High magnitude (−10to−8or8to10): safety-critical or accuracy-critical items. – Medium magnitude (−7to−4or4to7): completeness and contextual coverage. – Low magnitude (−3to−1or1to3): minor details or communication style

work page

[24] [24]

Correctly identifies the recommended dosage of 500 mg

Allowed Axes: – accuracy: factual correctness and safety violations. – completeness: coverage of required topics. – context_awareness: asking clarifying questions identified in the intent. – communication_quality: tone, empathy, and clarity. – instruction_following: formatting or explicit constraints. FORMAT CONSTRAINTS: – Aim for comprehensive coverage w...

work page

[25] [25]

{user_query}

User Query: “{user_query}”

work page

[26] [26]

Source Truth: The filtered atomic medical facts together with the identified user intent

work page

[27] [27]

rubrics”: [ { “criterion

Draft Rubrics: The current set of generated evaluation criteria. AUDIT PROCEDURE: PHASE 1: Gap Analysis and Supplementation (CRITICAL) – Scan the Source Truth, including all symptoms, treatments, safety red flags, and contextual questions. – Check whether each item is covered by the draft rubrics. – Action: If any key fact (e.g., a specific drug, symptom,...

work page