hub

Med42-v2: A suite of clinical llms

· 2024 · arXiv 2408.06142

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

cs.CL · 2026-05-19 · accept · novelty 7.0

A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

cs.CL · 2026-05-15 · conditional · novelty 7.0

MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

MedHopQA introduces a 1,000-question two-hop biomedical QA benchmark where retrieval-augmented systems reach 89% conceptual accuracy, outperforming zero-shot baselines by over 20 points.

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

cs.CL · 2025-02-11 · unverdicted · novelty 7.0

Evaluation of 22 LLMs shows they are more susceptible to spin in medical abstracts than humans but can recognize and mitigate it when prompted.

The strength of clinical evidence is recoverable from language model representations but not from their stated grades

cs.CL · 2026-06-27 · unverdicted · novelty 6.0

Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

FairEnc reduces demographic biases in VLMs for glaucoma detection via LLM-generated synthetic text and dual-level visual debiasing while preserving diagnostic accuracy across datasets.

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 5.0

C-MIG uses multi-view information gain from retrieved documents and refinements to supervise RAG-RL for clinical diagnosis, claiming top performance on four medical benchmarks.

TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

TheraAgent uses iterative agentic refinement with an integrated clinical judge to produce more accurate, complete, and safer treatment plans than standard LLMs.

Medical Reasoning with Large Language Models: A Survey and MR-Bench

cs.CL · 2026-03-17 · accept · novelty 5.0

LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

cs.LG · 2026-06-10 · unverdicted · novelty 4.0

GlyLLM applies pre-trained LLMs to integrate CGM sensor data with structured metadata for glucose forecasting and diabetes categorization, reporting 13.66% lower RMSE and 13.08% higher AUROC than traditional ML on the AI-READI dataset.

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

cs.CL · 2026-06-02 · unverdicted · novelty 4.0

HERALD selectively encrypts sensitive tokens via medical NER, POS policies, and deterministic ciphertext substitution to enable privacy-preserving clinical LLM use while recovering near-plaintext task performance.

ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook

eess.SP · 2026-04-02 · unverdicted · novelty 3.0

ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

cs.AI · 2026-04-07

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Med42-v2: A suite of clinical llms

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer