Fetal-Gauge benchmark shows state-of-the-art vision-language models reach only 55% accuracy on fetal ultrasound tasks, well below clinical needs and highlighting the requirement for domain-adapted models.
Aya vision: Advancing the frontier of multilingual multimodality
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2025 2verdicts
UNVERDICTED 2representative citing papers
SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.
citing papers explorer
-
FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound
Fetal-Gauge benchmark shows state-of-the-art vision-language models reach only 55% accuracy on fetal ultrasound tasks, well below clinical needs and highlighting the requirement for domain-adapted models.
-
Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning
SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.