SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
arXiv preprint arXiv:2404.00376 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it