CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
Pith reviewed 2026-05-21 00:47 UTC · model grok-4.3
The pith
Open-weight LLMs reach only 53 percent success on rodent behavioral tasks in a shared zero-shot ASCII protocol, well below the 79 percent approximate animal baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a unified zero-shot ASCII protocol with no task-specific instructions, six open-weight LLMs from 3B to 72B parameters average well below approximate rodent baselines across nine classical behavioral neuroscience paradigms. The strongest model reaches 52.6 percent success compared with 78.9 percent for rodents and 32.1 percent for random agents. The same model’s score can swing from 20 to 57 percent depending only on interface parameters such as context length and prompting style, showing that the agent-plus-interface system rather than the isolated model is being measured.
What carries the argument
The unified zero-shot ASCII protocol in which agents receive only a generic system prompt and must discover task goals from text observations and scalar rewards without any task-specific instructions.
If this is right
- Scaling model size past 7 billion parameters yields diminishing or no returns on these spatial and memory tasks.
- Increasing context history length tends to lower rather than raise success rates.
- Chain-of-thought prompting reduces average performance compared with direct output.
- Vision-language architectures improve results at 7B scale but degrade them at 32B scale.
- Spatial navigation and within-trial state tracking remain the clearest areas of underperformance relative to rodent references.
Where Pith is reading between the lines
- The benchmark isolates interface effects from raw model capacity, suggesting future work should standardize or ablate rendering choices before claiming intrinsic limits.
- If performance improves on these abstracted tasks, the gains could indicate better internal world models that transfer to planning or robotics domains.
- Task-by-task error patterns might map onto specific deficits such as mental rotation or working memory that parallel known computational neuroscience accounts.
- Extending the protocol to include richer sensory channels could test whether current gaps stem from text poverty rather than reasoning shortfalls.
Load-bearing premise
ASCII text renderings and reward signals supply a fair proxy for the cognitive demands of the physical rodent paradigms.
What would settle it
Run the identical LLM agent on a pixel- or physics-based simulation of the same mazes and chambers and measure whether the performance gap to rodent baselines shrinks or stays the same.
Figures
read the original abstract
We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CheeseBench, a benchmark evaluating six open-weight LLMs (3B–72B) on nine rodent behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, delayed non-match to sample) rendered as ASCII text. Agents receive a single unified zero-shot prompt and must infer goals from text observations and reward signals. Results show the best model (Qwen2.5-VL-7B) at 52.6% average success versus 32.1% random and 78.9% approximate rodent baselines, with additional findings that scaling beyond 7B yields diminishing returns, longer context degrades performance, chain-of-thought hurts, and vision-language architectures show mixed effects. Performance varies 20–57% with interface parameters alone.
Significance. If the ASCII proxy adequately captures the targeted cognitive dimensions, the work supplies a reproducible, unified protocol for testing LLM spatial navigation and state-tracking abilities against external animal baselines. The explicit documentation of interface sensitivity is a strength, as is the direct empirical measurement against both random and rodent references without fitted parameters or circular derivations. This contributes concrete data on where current open-weight models fall short relative to biological systems on these tasks.
major comments (2)
- [Task Design and Rendering] The central claim that LLMs remain below rodent baselines on navigation and state-tracking tasks rests on the assumption that the chosen ASCII renderings impose equivalent demands to the physical paradigms. For Morris water maze and radial arm maze, rodents integrate continuous distal cues and path integration; the discrete grid or symbolic text may permit explicit deduction or pattern matching instead. The manuscript notes 20–57% variation with interface parameters but does not quantify how far the selected ASCII format deviates from the sensory statistics that render the original tasks difficult. This directly affects whether the 52.6% vs 78.9% gap can be attributed primarily to model capability.
- [Abstract and Results] Abstract and §4 (Results): Success rates are presented without error bars, exact trial-by-trial success criteria, or details on how ASCII observations and reward signals are procedurally generated for each paradigm. These omissions leave the reported averages only partially verifiable and weaken assessment of whether the gap to rodent baselines is statistically robust.
minor comments (2)
- [Results] Table 1 or equivalent results table: Include standard deviations or confidence intervals alongside mean success rates to allow readers to gauge variability across runs or seeds.
- [Figures] Figure captions for ASCII examples: Add explicit scale or grid size information so readers can assess information density relative to the physical apparatus dimensions described in the rodent protocols.
Circularity Check
No significant circularity
full rationale
The paper introduces CheeseBench as an empirical benchmark and reports direct performance measurements of LLMs on ASCII-rendered rodent tasks, compared against random baselines, a graph-based RL agent, and approximate rodent reference values drawn from peer-reviewed literature. No derivations, equations, fitted parameters, or first-principles predictions are present; success rates are measured outcomes rather than quantities defined or forced by the paper's own constructs. The central claim (LLMs below rodent baselines under the unified zero-shot ASCII protocol) rests on external animal data and observed model outputs, not on self-referential reductions or self-citation chains that would make the result tautological. Interface sensitivity is explicitly noted as a finding rather than hidden or redefined. This is a standard empirical evaluation with no load-bearing steps that collapse to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption ASCII text renderings and reward signals sufficiently proxy the cognitive dimensions of the physical rodent paradigms
- domain assumption Approximate rodent baselines provide a meaningful reference for LLM performance
Forward citations
Cited by 2 Pith papers
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
Reference graph
Works this paper leans on
-
[1]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdin, M., Awan, A. A., Baldassini, L., et al. (2025). Phi-4-multimodal technical report.arXiv preprint arXiv:2503.01743. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit- nick, C. L., and Parikh, D. (2015). VQA: Visual ques- tion answering. InProceedings of the IEEE Interna- tional Conference on Computer Vision, pages 2425–
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-VL: Enhancing vision-language model’s per- ception of the world at any resolution.arXiv preprint arXiv:2502.13923. Chac´on-Fern´andez, P., S´anchez-Campusano, R., Gruart, A., and Delgado-Garc´ıa, J. M. (2016). Long-term tread- mill exerci...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Lu, P., Banerjee, H., et al. (2024). MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.Proceedings of the International Conference on Learning Representations. O’Keefe, J. and Dostrovsky, J. (1971). The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain Research, 34(1):171–
work page 2024
-
[4]
Olton, D. S. and Samuelson, R. J. (1976). Remembrance of places passed: Spatial memory in rats.Journal of Experimental Psychology: Animal Behavior Processes, 2(2):97–116. Shoji, H., Hagihara, H., Takao, K., Hattori, S., and Miyakawa, T. (2012). T-maze forced alternation and left-right discrimination tasks for assessing working and reference memory in mice...
work page 1976
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.