Clinical reasoning graphs reveal that LLMs exhibit diagnostic competence on complex cases but lack consistent schema-scale reasoning patterns across similar cases.
hub Mixed citations
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Mixed citation behavior. Most common role is background (53%).
abstract
We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Presents the first fully open pipeline for clinical LLMs by unifying eight public QA datasets with three clinician-vetted synthetic extensions and applying it to five base models to achieve benchmark gains while maintaining auditability.
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.
Releases mamabench (25,949 QA items from seven expert sources) and mamaretrieval (3,185 graded queries over 63,650 chunks) to evaluate RAG in maternal, neonatal, and reproductive health.
IMCBench is a new benchmark for image-grounded multi-turn medical conversations that evaluates eight multimodal LLMs on safety, accuracy, and uncertainty, finding Claude Opus highest overall but safety drops for malignant and rare conditions.
RGSD distills rubric-conditioned teacher distributions into base policies token-by-token, matching GRPO rubric satisfaction on Qwen models with one rollout and zero verifier calls.
A new benchmark and clean-room harness show frontier AI agents reach only 0.337 factual F1 when synthesizing conclusions from scientific evidence.
SkeMex distills agent trajectories into value-aware skills organized in general/task/action branches and evolves them via a closed-loop Read-Write-Assess-Govern process, outperforming prior memory agents on clinical tasks.
MedSP1000 benchmark shows top LLMs complete at most 60.4% of expert rubric items during multi-turn standardized patient simulations.
BigFinanceBench is a workflow-grounded benchmark of 928 financial research tasks with point-weighted rubrics, where the best of ten tested agents scores 58.8% on derivation quality.
PaSBench-Video benchmark shows no tested MLLM exceeds 20% on strict proactive safety metrics, with recall correlated 0.64 to false-positive rate on safe clips.
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.
RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance and up to 10x better sample efficiency than logit-based approaches.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
citing papers explorer
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.