MRI-Eval benchmark shows frontier LLMs scoring 93-97% on MRI MCQs but falling to 37-61% on stem-only questions, with GE scanner operations as the weakest category for all models.
Overalignment in Frontier LLMs: An Empirical Study of Sycophantic Behaviour in Health- care, January 2026
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
support 1representative citing papers
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
citing papers explorer
-
MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge
MRI-Eval benchmark shows frontier LLMs scoring 93-97% on MRI MCQs but falling to 37-61% on stem-only questions, with GE scanner operations as the weakest category for all models.
-
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.