Recognition: unknown
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3
The pith
Open-weight LLMs reach only 53 percent success on ASCII versions of classic rodent tasks where animals average 79 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CheeseBench evaluates LLMs on nine peer-reviewed rodent paradigms rendered as unified zero-shot ASCII text environments. The best evaluated model, Qwen2.5-VL-7B, reaches 52.6 percent average success while approximate rodent baselines stand at 78.9 percent and random agents at 32.1 percent. Performance drops on spatial navigation and within-trial state tracking, shows diminishing returns beyond 7B parameters, worsens with longer context, and is harmed by chain-of-thought prompting. Vision-language architectures help at 7B but hurt at 32B. The results characterize the full agent-plus-interface system rather than isolated model capability.
What carries the argument
CheeseBench, a unified zero-shot ASCII protocol that presents rodent tasks as text observations and reward signals without any task-specific instructions.
Load-bearing premise
The ASCII text renderings of the tasks accurately capture the core cognitive and perceptual demands of the original rodent behavioral paradigms without introducing artifacts that alter difficulty.
What would settle it
An open-weight LLM achieving approximately 79 percent average success across the nine ASCII tasks under the same zero-shot protocol would directly contradict the central claim.
Figures
read the original abstract
We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CheeseBench, a benchmark evaluating six open-weight LLMs (3B–72B) on nine rodent behavioral neuroscience tasks (Morris water maze, Barnes maze, T-maze, etc.) rendered as unified zero-shot ASCII text environments. Agents receive only a general system prompt and must infer goals from observations and reward signals. The best model (Qwen2.5-VL-7B) achieves 52.6% average success versus 32.1% random and 78.9% approximate rodent baselines. Additional findings include diminishing returns beyond 7B, negative effects from longer context and chain-of-thought, and interface sensitivity; the central conclusion is that current LLM agents remain well below rodent levels, especially on spatial navigation and within-trial state tracking.
Significance. If the rodent baselines prove comparable and the ASCII renderings preserve core demands, the benchmark offers a reproducible, multi-task protocol for probing LLM cognitive capabilities against animal references, with explicit controls for random and graph-RL agents. The empirical scaling, prompting, and architecture ablations are useful for the field. The work's value is reduced by the lack of transparent derivation for the 78.9% rodent figures and limited validation that text interfaces equate to physical task difficulty.
major comments (2)
- [Abstract] Abstract: The central claim that LLMs 'remain well below approximate rodent reference values' (78.9% average) is load-bearing for the headline result and interpretation of the performance gap. However, no equation, table, or section details how these rodent percentages were obtained (first-trial naïve performance, multi-session averages, trial counts, or adjustments for loss of olfactory/visual cues in the ASCII zero-shot setting). This leaves the gap sensitive to unstated assumptions about baseline equivalence.
- [Task implementation] Task implementation section: The assumption that ASCII renderings accurately capture the cognitive and perceptual demands of the original paradigms (e.g., spatial navigation in Morris water maze or state tracking in delayed non-match to sample) without interface artifacts is central to attributing results to model limitations rather than protocol mismatch. No explicit validation, mapping details, or sensitivity analysis to rendering choices is provided to support this equivalence.
minor comments (2)
- [Results] Results: The statement that performance 'ranges from 20% to 57% depending on interface parameters alone' would be strengthened by a table listing the exact parameters varied and the corresponding scores for the best model.
- [Abstract] Abstract: Clarify the precise parameter counts and names of all six evaluated models to allow direct replication of the scaling and architecture comparisons.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, indicating where revisions will be made to improve transparency.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that LLMs 'remain well below approximate rodent reference values' (78.9% average) is load-bearing for the headline result and interpretation of the performance gap. However, no equation, table, or section details how these rodent percentages were obtained (first-trial naïve performance, multi-session averages, trial counts, or adjustments for loss of olfactory/visual cues in the ASCII zero-shot setting). This leaves the gap sensitive to unstated assumptions about baseline equivalence.
Authors: We agree that greater transparency on the rodent baselines is needed. These approximate values (averaging 78.9%) were compiled from peer-reviewed rodent studies for each of the nine tasks, prioritizing naïve or early-trial performance metrics to best align with the zero-shot protocol. Adjustments for missing olfactory and visual cues were implicit in selecting text-compatible metrics rather than explicitly quantified. In revision we will add an appendix or dedicated subsection listing the source papers, specific performance figures, trial conditions, and any assumptions for each task, allowing readers to evaluate the comparison directly. revision: yes
-
Referee: [Task implementation] Task implementation section: The assumption that ASCII renderings accurately capture the cognitive and perceptual demands of the original paradigms (e.g., spatial navigation in Morris water maze or state tracking in delayed non-match to sample) without interface artifacts is central to attributing results to model limitations rather than protocol mismatch. No explicit validation, mapping details, or sensitivity analysis to rendering choices is provided to support this equivalence.
Authors: We acknowledge the importance of clarifying the mapping between physical paradigms and ASCII environments. The manuscript already reports that the same model’s success rate varies from 20% to 57% solely due to interface parameter changes, which supports interpreting results as characterizing the agent-plus-interface system rather than isolated model capability. In the revised version we will expand the Task Implementation section with explicit rendering details and mappings for each task (e.g., how spatial layouts and state information are encoded in ASCII), citing the original rodent protocols. While a full empirical equivalence validation would require new physical experiments outside the current scope, we will add discussion of potential artifacts and their implications for the observed LLM–rodent gap. revision: partial
Circularity Check
No circularity: purely empirical benchmark with external literature baselines
full rationale
The paper introduces CheeseBench as an empirical evaluation of LLMs on ASCII-rendered rodent tasks, reporting observed success rates (e.g., 52.6% for Qwen2.5-VL-7B) against random baselines and approximate rodent values drawn from peer-reviewed protocols. No equations, fitted parameters, predictions, or derivations appear in the provided text. The rodent baselines are explicitly labeled 'approximate' and sourced externally rather than computed or redefined within the paper. All reported findings (scaling effects, context length, CoT, vision-language variants) are direct measurements under the fixed protocol, with no self-referential reduction of outputs to inputs. The central claim therefore rests on external data and experimental results rather than any internal construction that would qualify as circular under the enumerated patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
Reference graph
Works this paper leans on
-
[1]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdin, M., Awan, A. A., Baldassini, L., et al. (2025). Phi-4-multimodal technical report.arXiv preprint arXiv:2503.01743. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit- nick, C. L., and Parikh, D. (2015). VQA: Visual ques- tion answering. InProceedings of the IEEE Interna- tional Conference on Computer Vision, pages 2425–
work page internal anchor Pith review arXiv 2025
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-VL: Enhancing vision-language model’s per- ception of the world at any resolution.arXiv preprint arXiv:2502.13923. Chac´on-Fern´andez, P., S´anchez-Campusano, R., Gruart, A., and Delgado-Garc´ıa, J. M. (2016). Long-term tread- mill exerci...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Lu, P., Banerjee, H., et al. (2024). MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.Proceedings of the International Conference on Learning Representations. O’Keefe, J. and Dostrovsky, J. (1971). The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain Research, 34(1):171–
2024
-
[4]
Olton, D. S. and Samuelson, R. J. (1976). Remembrance of places passed: Spatial memory in rats.Journal of Experimental Psychology: Animal Behavior Processes, 2(2):97–116. Shoji, H., Hagihara, H., Takao, K., Hattori, S., and Miyakawa, T. (2012). T-maze forced alternation and left-right discrimination tasks for assessing working and reference memory in mice...
1976
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.