Pith Number

pith:VQ43UB3R

pith:2026:VQ43UB3RG4BKYMZRD773M6TI6O

not attested not anchored not stored refs resolved

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Abdine Maiga, Emine Yilmaz, Hossein A. Rahmani, Yinzhu Chen

A retrieval-augmented multi-agent system automatically generates instance-specific rubrics that ground medical dialogue evaluation in verifiable clinical facts.

arxiv:2601.15161 v2 · 2026-01-21 · cs.CL · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{VQ43UB3RG4BKYMZRD773M6TI6O}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

our framework achieves Clinical Intent Alignment (CIA) scores of 50.20% and 31.90%, significantly outperforming the GPT-4o baseline and demonstrating robust cross-lingual generalization. In discriminative tests on HealthBench, our rubrics yield a 7.8% higher win rate than GPT-4o baseline with nearly double score Δ.

C2weakest assumption

that retrieved authoritative medical content can be reliably decomposed into atomic facts and synthesized with interaction constraints to produce verifiable, fine-grained criteria without introducing new errors or hallucinations.

C3one line summary

A retrieval-augmented multi-agent system creates evidence-based, fine-grained rubrics for medical LLM evaluation, achieving 50.20% and 31.90% CIA scores on HealthBench and LLMEval-Med while outperforming GPT-4o baselines.

References

27 extracted · 27 resolved · 0 Pith anchors

[1] Guidelines: CDC (site:cdc.gov), WHO (site:who.int), NICE (site:nice.org.uk), Merck Manuals (site:merckmanuals.com)

[2] Drugs: Drugs.com (site:drugs.com), BNF (site:bnf.nice.org.uk)

[3] Patient Ed: Mayo Clinic (site:mayoclinic.org), Cleveland Clinic (site:clevelandclinic.org), NHS (site:nhs.uk)

[4] Research: PubMed (site:ncbi.nlm.nih.gov) Task:

[5] intent”: “string

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-18T02:45:06.027939Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

ac39ba07713702ac33311fffb67a68f3b11af11f8cb1c5faf6d6dce0c5a237b7

Aliases

arxiv: 2601.15161 · arxiv_version: 2601.15161v2 · doi: 10.48550/arxiv.2601.15161 · pith_short_12: VQ43UB3RG4BK · pith_short_16: VQ43UB3RG4BKYMZR · pith_short_8: VQ43UB3R

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/VQ43UB3RG4BKYMZRD773M6TI6O \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ac39ba07713702ac33311fffb67a68f3b11af11f8cb1c5faf6d6dce0c5a237b7

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "e33b3174929e4238e0e315c59977f1cfa46ec8ec91de621c06cfb178827ae9f4",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-01-21T16:40:41Z",
    "title_canon_sha256": "71ca9f7437f9cd29e19f2b77486d1eea22c0b7d054bc77417e736b2b2cba09dc"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.15161",
    "kind": "arxiv",
    "version": 2
  }
}