White-box method ReXTrust achieves highest AUC (peak 93.0) on Gut-VLM across five VLMs, outperforming alternatives by statistically significant margins while black-box and some gray-box methods collapse on certain models.
URL https://arxiv.org/abs/2102.09542
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 10roles
dataset 1polarities
use dataset 1representative citing papers
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.
JMed48k is a new benchmark of Japanese healthcare licensing exams used to evaluate 21 VLMs, with a paired image-removal audit revealing large differences in how models and professions benefit from visual content.
GPVAE replaces the standard VAE latent prior with a temporal Gaussian process prior, combined with endoscopy-specific encoders and specular masking, to achieve up to 26.1% lower image reconstruction RMSE on the C3VDv2 colonoscopy dataset.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
PSIRNet produces diagnostic-quality free-breathing PSIR LGE cardiac MRI from a single interleaved IR/PD acquisition over two heartbeats using a physics-guided deep learning network trained on over 800,000 slices.
A-ROM delivers competitive MedMNIST performance via pretrained ViT metric spaces, a concept dictionary, and kNN without backpropagation or fine-tuning, framed as interpretable few-shot learning under the Platonic Representation Hypothesis.
Denoising autoencoder pretraining on corrupted visual embeddings yields more robust Med-VQA performance on SLAKE and PathVQA while using LoRA for efficient LLM adaptation.
citing papers explorer
No citing papers match the current filters.