arXiv preprint arXiv:2404.00376 , year=

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks , author= · 2024 · arXiv 2404.00376

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

cs.CL · 2026-06-06 · unverdicted · novelty 7.0

SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

cs.AI · 2026-06-02

citing papers explorer

Showing 2 of 2 citing papers.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models cs.CL · 2026-06-06 · unverdicted · none · ref 68
SurgiQ is a new 13k-question surgical benchmark showing general-purpose LLMs reach 68.1% accuracy while most biomedical models lag and smaller models stay near random baseline.
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection cs.AI · 2026-06-02 · unreviewed · ref 16

arXiv preprint arXiv:2404.00376 , year=

fields

years

verdicts

representative citing papers

citing papers explorer