CheeseBench is a benchmark where LLMs act as zero-shot agents in text-rendered versions of classical rodent experiments, with the best model reaching 52.6% success compared to 32.1% random and 78.9% approximate rodent baselines.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
CheeseBench is a benchmark where LLMs act as zero-shot agents in text-rendered versions of classical rodent experiments, with the best model reaching 52.6% success compared to 32.1% random and 78.9% approximate rodent baselines.