{"total":15,"items":[{"citing_arxiv_id":"2606.29034","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The strength of clinical evidence is recoverable from language model representations but not from their stated grades","primary_cat":"cs.CL","submitted_at":"2026-06-27T18:06:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Linear probes recover evidence grades from LLM activations (median AUROC 71.8) across 22 models but the models' stated grades perform at chance level and the signal is largely lexical.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12699","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data","primary_cat":"cs.LG","submitted_at":"2026-06-10T21:39:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GlyLLM applies pre-trained LLMs to integrate CGM sensor data with structured metadata for glucose forecasting and diabetes categorization, reporting 13.66% lower RMSE and 13.08% higher AUROC than traditional ML on the AI-READI dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08071","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-06T09:45:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03399","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-02T09:40:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HERALD selectively encrypts sensitive tokens via medical NER, POS policies, and deterministic ciphertext substitution to enable privacy-preserving clinical LLM use while recovering near-plaintext task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30637","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-28T22:38:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EHRBench uses an EHR-LLM-KB pipeline to automatically create 960,067 reliable QA items spanning diagnosis, treatment, and prognosis for large-scale LLM evaluation in clinical decision making.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27860","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-27T02:20:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"C-MIG uses multi-view information gain from retrieved documents and refinements to supervise RAG-RL for clinical diagnosis, claiming top performance on four medical benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20537","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework","primary_cat":"cs.CL","submitted_at":"2026-05-19T22:19:22+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15589","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-15T03:55:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12313","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MedHopQA introduces a 1,000-question two-hop biomedical QA benchmark where retrieval-augmented systems reach 89% conceptual accuracy, outperforming zero-shot baselines by over 20 points.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Their retrieval pipeline combined semantic vector search with keyword-based BM25S retrieval (46), followed by reranking using reciprocal rank fusion (47). To address multi-hop questions, the system applied structured prompting to decompose each query into simpler sub-questions. These were processed independently by a local Llama3 med42 8B model (48), and the resulting intermediate answers were subsequently combined to produce a concise final response. lasigeBioTM team, Sofia I. R. Conceição and Paulo R. C. Lopes, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa The lasigeBioTM team(37) developed a RAG-based system using the Mistral-7B-Instruct- v0.3 model (49), augmented with external knowledge from Wikipedia and the Mondo"},{"citing_arxiv_id":"2605.05963","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning","primary_cat":"cs.AI","submitted_at":"2026-05-07T10:10:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TheraAgent uses iterative agentic refinement with an integrated clinical judge to produce more accurate, complete, and safer treatment plans than standard LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04882","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection","primary_cat":"cs.CV","submitted_at":"2026-05-06T13:18:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FairEnc reduces demographic biases in VLMs for glaucoma detection via LLM-generated synthetic text and dual-level visual debiasing while preserving diagnostic accuracy across datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05435","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions","primary_cat":"cs.AI","submitted_at":"2026-04-07T05:04:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02501","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ECG Foundation Models and Medical LLMs for Agentic Cardiovascular Intelligence at the Edge: A Review and Outlook","primary_cat":"eess.SP","submitted_at":"2026-04-02T20:09:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"ECG foundation models for signal interpretation and medical LLMs for reasoning can be integrated into agentic systems for real-time cardiovascular intelligence on edge devices.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"K¨opf, A. Mohtashami,et al., \"Meditron-70b: Scaling medical pretraining for large language models,\"arXiv preprint arXiv:2311.16079, 2023. [16] Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, and R. Dufour, \"Biomistral: A collection of open-source pre- trained large language models for medical domains,\"arXiv preprint arXiv:2402.10373, 2024. [17] C. Christophe, P. K. Kanithi, T. Raha, S. Khan, and M. A. Pimentel, \"Med42-v2: A suite of clinical llms,\"arXiv preprint arXiv:2408.06142, 2024. [18] H. Yu, P. Guo, and A. Sano, \"Ecg semantic integrator (esi): A foundation ecg model pretrained with llm-enhanced cardiological text,\" arXiv preprint arXiv:2405.19366, 2024. [19] N. Chan, F. Parker, W. Bennett, T."},{"citing_arxiv_id":"2604.08559","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Medical Reasoning with Large Language Models: A Survey and MR-Bench","primary_cat":"cs.CL","submitted_at":"2026-03-17T09:03:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.07963","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?","primary_cat":"cs.CL","submitted_at":"2025-02-11T21:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evaluation of 22 LLMs shows they are more susceptible to spin in medical abstracts than humans but can recognize and mitigate it when prompted.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}