SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
hub
Med42-v2: A suite of clinical llms
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.
MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.
MedHopQA introduces a 1,000-question two-hop biomedical QA benchmark where retrieval-augmented systems reach 89% conceptual accuracy, outperforming zero-shot baselines by over 20 points.
Evaluation of 22 LLMs shows they are more susceptible to spin in medical abstracts than humans but can recognize and mitigate it when prompted.
Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.
EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.
FairEnc reduces demographic biases in VLMs for glaucoma detection via LLM-generated synthetic text and dual-level visual debiasing while preserving diagnostic accuracy across datasets.
C-MIG uses multi-view information gain from retrieved documents and refinements to supervise RAG-RL for clinical diagnosis, claiming top performance on four medical benchmarks.
TheraAgent uses iterative agentic refinement with an integrated clinical judge to produce more accurate, complete, and safer treatment plans than standard LLMs.
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
GlyLLM applies pre-trained LLMs to integrate CGM sensor data with structured metadata for glucose forecasting and diabetes categorization, reporting 13.66% lower RMSE and 13.08% higher AUROC than traditional ML on the AI-READI dataset.
HERALD selectively encrypts sensitive tokens via medical NER, POS policies, and deterministic ciphertext substitution to enable privacy-preserving clinical LLM use while recovering near-plaintext task performance.
ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.
citing papers explorer
No citing papers match the current filters.