CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Zacharie Bugaud

arxiv: 2604.10825 · v2 · pith:TRUGLNEYnew · submitted 2026-04-12 · 💻 cs.AI

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Zacharie Bugaud This is my paper

Pith reviewed 2026-05-21 00:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords CheeseBenchLLM evaluationrodent behaviorbehavioral neurosciencezero-shot learningspatial navigationASCII environmentsbenchmark

0 comments

The pith

Open-weight LLMs reach only 53 percent success on rodent behavioral tasks in a shared zero-shot ASCII protocol, well below the 79 percent approximate animal baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CheeseBench to test large language models on nine established rodent neuroscience paradigms including various mazes and conditioning chambers. Models operate under a single system prompt with no task-specific guidance and must infer goals solely from ASCII text observations and reward signals. The best evaluated model achieves 52.6 percent average success, exceeding random performance at 32.1 percent but remaining substantially below approximate rodent reference values at 78.9 percent. Results indicate that model scale beyond 7 billion parameters brings little gain, while longer context and chain-of-thought prompting both reduce scores. Performance gaps are largest on tasks that require spatial navigation and tracking of within-trial state.

Core claim

Under a unified zero-shot ASCII protocol with no task-specific instructions, six open-weight LLMs from 3B to 72B parameters average well below approximate rodent baselines across nine classical behavioral neuroscience paradigms. The strongest model reaches 52.6 percent success compared with 78.9 percent for rodents and 32.1 percent for random agents. The same model’s score can swing from 20 to 57 percent depending only on interface parameters such as context length and prompting style, showing that the agent-plus-interface system rather than the isolated model is being measured.

What carries the argument

The unified zero-shot ASCII protocol in which agents receive only a generic system prompt and must discover task goals from text observations and scalar rewards without any task-specific instructions.

If this is right

Scaling model size past 7 billion parameters yields diminishing or no returns on these spatial and memory tasks.
Increasing context history length tends to lower rather than raise success rates.
Chain-of-thought prompting reduces average performance compared with direct output.
Vision-language architectures improve results at 7B scale but degrade them at 32B scale.
Spatial navigation and within-trial state tracking remain the clearest areas of underperformance relative to rodent references.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark isolates interface effects from raw model capacity, suggesting future work should standardize or ablate rendering choices before claiming intrinsic limits.
If performance improves on these abstracted tasks, the gains could indicate better internal world models that transfer to planning or robotics domains.
Task-by-task error patterns might map onto specific deficits such as mental rotation or working memory that parallel known computational neuroscience accounts.
Extending the protocol to include richer sensory channels could test whether current gaps stem from text poverty rather than reasoning shortfalls.

Load-bearing premise

ASCII text renderings and reward signals supply a fair proxy for the cognitive demands of the physical rodent paradigms.

What would settle it

Run the identical LLM agent on a pixel- or physics-based simulation of the same mazes and chambers and measure whether the performance gap to rodent baselines shrinks or stays the same.

Figures

Figures reproduced from arXiv: 2604.10825 by Zacharie Bugaud.

**Figure 1.** Figure 1: ASCII renderings of all nine CheeseBench en [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Per-environment success rates (ASCII text input) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cognitive profile across six dimensions. The LLM [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling behavior: performance saturates beyond [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CheeseBench is a straightforward new benchmark that tests LLMs on rodent tasks via ASCII text and shows clear interface sensitivity, but the rodent comparison rests on how well the text versions match the original cognitive demands.

read the letter

CheeseBench puts six open-weight LLMs through nine standard rodent paradigms rendered as ASCII text with reward signals only. The best result is 52.6% average success against roughly 79% for the animal baselines and 32% for random. The paper also reports that the same model can swing from 20% to 57% just by changing context length, prompting style, or vision-language setup, and that scaling past 7B brings little gain while chain-of-thought hurts performance here. That interface dependence is the most useful concrete finding because it frames the results as properties of the agent-plus-input system rather than the model in isolation. The unified zero-shot protocol across tasks like the Morris water maze, radial arm maze, and delayed non-match to sample is new and cleanly executed, with direct comparisons to both random and a graph-based RL agent. The work stays empirical and avoids fitting parameters or self-referential claims. The main soft spot is whether the ASCII renderings actually impose the same spatial and state-tracking demands as the physical setups. Rodents integrate continuous cues and path integration; an LLM can often succeed by parsing discrete symbols or grids directly. The paper already flags the large interface effects, which is honest, but it does not include a direct quantification of how much the chosen text format changes task difficulty relative to the original sensory statistics. If the text version is easier in some dimensions, the reported gap to rodent baselines cannot be read as a pure model limitation. This is useful for anyone building or evaluating LLM agents that need to handle navigation or memory from limited observations. The methods are transparent enough and the results are falsifiable, so the paper deserves a serious referee who can check the exact rendering details and success criteria. I would send it out for review with a request that the authors clarify how they chose the ASCII formats and whether they tested alternative renderings.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CheeseBench, a benchmark evaluating six open-weight LLMs (3B–72B) on nine rodent behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, delayed non-match to sample) rendered as ASCII text. Agents receive a single unified zero-shot prompt and must infer goals from text observations and reward signals. Results show the best model (Qwen2.5-VL-7B) at 52.6% average success versus 32.1% random and 78.9% approximate rodent baselines, with additional findings that scaling beyond 7B yields diminishing returns, longer context degrades performance, chain-of-thought hurts, and vision-language architectures show mixed effects. Performance varies 20–57% with interface parameters alone.

Significance. If the ASCII proxy adequately captures the targeted cognitive dimensions, the work supplies a reproducible, unified protocol for testing LLM spatial navigation and state-tracking abilities against external animal baselines. The explicit documentation of interface sensitivity is a strength, as is the direct empirical measurement against both random and rodent references without fitted parameters or circular derivations. This contributes concrete data on where current open-weight models fall short relative to biological systems on these tasks.

major comments (2)

[Task Design and Rendering] The central claim that LLMs remain below rodent baselines on navigation and state-tracking tasks rests on the assumption that the chosen ASCII renderings impose equivalent demands to the physical paradigms. For Morris water maze and radial arm maze, rodents integrate continuous distal cues and path integration; the discrete grid or symbolic text may permit explicit deduction or pattern matching instead. The manuscript notes 20–57% variation with interface parameters but does not quantify how far the selected ASCII format deviates from the sensory statistics that render the original tasks difficult. This directly affects whether the 52.6% vs 78.9% gap can be attributed primarily to model capability.
[Abstract and Results] Abstract and §4 (Results): Success rates are presented without error bars, exact trial-by-trial success criteria, or details on how ASCII observations and reward signals are procedurally generated for each paradigm. These omissions leave the reported averages only partially verifiable and weaken assessment of whether the gap to rodent baselines is statistically robust.

minor comments (2)

[Results] Table 1 or equivalent results table: Include standard deviations or confidence intervals alongside mean success rates to allow readers to gauge variability across runs or seeds.
[Figures] Figure captions for ASCII examples: Add explicit scale or grid size information so readers can assess information density relative to the physical apparatus dimensions described in the rodent protocols.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces CheeseBench as an empirical benchmark and reports direct performance measurements of LLMs on ASCII-rendered rodent tasks, compared against random baselines, a graph-based RL agent, and approximate rodent reference values drawn from peer-reviewed literature. No derivations, equations, fitted parameters, or first-principles predictions are present; success rates are measured outcomes rather than quantities defined or forced by the paper's own constructs. The central claim (LLMs below rodent baselines under the unified zero-shot ASCII protocol) rests on external animal data and observed model outputs, not on self-referential reductions or self-citation chains that would make the result tautological. Interface sensitivity is explicitly noted as a finding rather than hidden or redefined. This is a standard empirical evaluation with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical benchmark paper; the central claims rest on the validity of task approximations rather than mathematical derivations or new theoretical entities.

axioms (2)

domain assumption ASCII text renderings and reward signals sufficiently proxy the cognitive dimensions of the physical rodent paradigms
Invoked in the description of the agent receiving observations and rewards without task-specific instructions.
domain assumption Approximate rodent baselines provide a meaningful reference for LLM performance
Used for the 78.9% comparison without new animal data collection.

pith-pipeline@v0.9.0 · 5823 in / 1450 out tokens · 46698 ms · 2026-05-21T00:47:18.118551+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 unverdicted novelty 6.0

NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdin, M., Awan, A. A., Baldassini, L., et al. (2025). Phi-4-multimodal technical report.arXiv preprint arXiv:2503.01743. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit- nick, C. L., and Parikh, D. (2015). VQA: Visual ques- tion answering. InProceedings of the IEEE Interna- tional Conference on Computer Vision, pages 2425–

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-VL: Enhancing vision-language model’s per- ception of the world at any resolution.arXiv preprint arXiv:2502.13923. Chac´on-Fern´andez, P., S´anchez-Campusano, R., Gruart, A., and Delgado-Garc´ıa, J. M. (2016). Long-term tread- mill exerci...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Lu, P., Banerjee, H., et al. (2024). MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.Proceedings of the International Conference on Learning Representations. O’Keefe, J. and Dostrovsky, J. (1971). The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain Research, 34(1):171–

work page 2024
[4]

Olton, D. S. and Samuelson, R. J. (1976). Remembrance of places passed: Spatial memory in rats.Journal of Experimental Psychology: Animal Behavior Processes, 2(2):97–116. Shoji, H., Hagihara, H., Takao, K., Hattori, S., and Miyakawa, T. (2012). T-maze forced alternation and left-right discrimination tasks for assessing working and reference memory in mice...

work page 1976

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdin, M., Awan, A. A., Baldassini, L., et al. (2025). Phi-4-multimodal technical report.arXiv preprint arXiv:2503.01743. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit- nick, C. L., and Parikh, D. (2015). VQA: Visual ques- tion answering. InProceedings of the IEEE Interna- tional Conference on Computer Vision, pages 2425–

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-VL: Enhancing vision-language model’s per- ception of the world at any resolution.arXiv preprint arXiv:2502.13923. Chac´on-Fern´andez, P., S´anchez-Campusano, R., Gruart, A., and Delgado-Garc´ıa, J. M. (2016). Long-term tread- mill exerci...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Lu, P., Banerjee, H., et al. (2024). MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.Proceedings of the International Conference on Learning Representations. O’Keefe, J. and Dostrovsky, J. (1971). The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain Research, 34(1):171–

work page 2024

[4] [4]

Olton, D. S. and Samuelson, R. J. (1976). Remembrance of places passed: Spatial memory in rats.Journal of Experimental Psychology: Animal Behavior Processes, 2(2):97–116. Shoji, H., Hagihara, H., Takao, K., Hattori, S., and Miyakawa, T. (2012). T-maze forced alternation and left-right discrimination tasks for assessing working and reference memory in mice...

work page 1976