X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Knowledge-Clue-Answer.
PBS-VL trained on the new PBSInstr dataset outperforms general and pathology MLLMs on the PBSBench VQA tasks for hematopathology.
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
EchoAgent is a new agentic AI system that integrates visual observation, quantitative measurement, and expert knowledge reasoning to achieve reliable echocardiography interpretation with up to 80% accuracy on CAMUS and MIMIC-EchoQA datasets.
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.
fMRI-LM builds a foundation model that aligns fMRI signals with language through tokenization, LLM adaptation, and instruction tuning to enable semantic understanding of brain activity.
citing papers explorer
-
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
-
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Knowledge-Clue-Answer.
-
PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation
PBS-VL trained on the new PBSInstr dataset outperforms general and pathology MLLMs on the PBSBench VQA tasks for hematopathology.
-
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
-
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
-
EchoAgent: Towards Reliable Echocardiography Interpretation with "Eyes","Hands" and "Minds"
EchoAgent is a new agentic AI system that integrates visual observation, quantitative measurement, and expert knowledge reasoning to achieve reliable echocardiography interpretation with up to 80% accuracy on CAMUS and MIMIC-EchoQA datasets.
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.
-
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
fMRI-LM builds a foundation model that aligns fMRI signals with language through tokenization, LLM adaptation, and instruction tuning to enable semantic understanding of brain activity.