arxiv: 2604.10825 · v1 · submitted 2026-04-12 · 💻 cs.AI

Recognition: unknown

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Zacharie Bugaud

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords CheeseBenchLLM evaluationrodent behaviorbehavioral neuroscienceASCII renderingzero-shot promptingspatial navigationcognitive benchmarks

0 comments

The pith

Open-weight LLMs reach only 53 percent success on ASCII versions of classic rodent tasks where animals average 79 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CheeseBench to test whether large language models can discover and solve standard rodent behavioral tasks when given only ASCII text descriptions of the environment and reward signals, with no task-specific instructions. It renders nine established paradigms including various mazes, operant chambers, and conditioning boxes as text grids and compares six open models against both random agents and approximate rodent performance levels. The central finding is that even the strongest model falls well short of rodent baselines, especially on spatial navigation and tasks that require tracking state within a trial. This gap persists despite scaling model size, and performance proves highly sensitive to details like context length and prompting style. The work matters because it supplies a concrete, grounded way to measure whether language models possess the kinds of flexible problem-solving rodents demonstrate when placed in unfamiliar apparatus.

Core claim

CheeseBench evaluates LLMs on nine peer-reviewed rodent paradigms rendered as unified zero-shot ASCII text environments. The best evaluated model, Qwen2.5-VL-7B, reaches 52.6 percent average success while approximate rodent baselines stand at 78.9 percent and random agents at 32.1 percent. Performance drops on spatial navigation and within-trial state tracking, shows diminishing returns beyond 7B parameters, worsens with longer context, and is harmed by chain-of-thought prompting. Vision-language architectures help at 7B but hurt at 32B. The results characterize the full agent-plus-interface system rather than isolated model capability.

What carries the argument

CheeseBench, a unified zero-shot ASCII protocol that presents rodent tasks as text observations and reward signals without any task-specific instructions.

Load-bearing premise

The ASCII text renderings of the tasks accurately capture the core cognitive and perceptual demands of the original rodent behavioral paradigms without introducing artifacts that alter difficulty.

What would settle it

An open-weight LLM achieving approximately 79 percent average success across the nine ASCII tasks under the same zero-shot protocol would directly contradict the central claim.

Figures

Figures reproduced from arXiv: 2604.10825 by Zacharie Bugaud.

**Figure 1.** Figure 1: ASCII renderings of all nine CheeseBench en [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Per-environment success rates (ASCII text input) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Cognitive profile across six dimensions. The LLM [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling behavior: performance saturates beyond [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CheeseBench creates a unified zero-shot ASCII benchmark from nine rodent tasks and reports LLMs at 52% vs approximate rodent 79%, but the gap claim depends on unshown baseline details.

read the letter

The main thing here is that CheeseBench renders nine standard rodent paradigms as text-only ASCII environments with a single zero-shot prompt and reward signals, then shows open-weight LLMs maxing out at 52.6% average success against a random baseline of 32% and rodent references around 79%. The paper also notes that performance varies sharply with interface choices and that scaling, longer context, and chain-of-thought all fail to help much beyond 7B parameters.

Referee Report

2 major / 2 minor

Summary. The paper introduces CheeseBench, a benchmark evaluating six open-weight LLMs (3B–72B) on nine rodent behavioral neuroscience tasks (Morris water maze, Barnes maze, T-maze, etc.) rendered as unified zero-shot ASCII text environments. Agents receive only a general system prompt and must infer goals from observations and reward signals. The best model (Qwen2.5-VL-7B) achieves 52.6% average success versus 32.1% random and 78.9% approximate rodent baselines. Additional findings include diminishing returns beyond 7B, negative effects from longer context and chain-of-thought, and interface sensitivity; the central conclusion is that current LLM agents remain well below rodent levels, especially on spatial navigation and within-trial state tracking.

Significance. If the rodent baselines prove comparable and the ASCII renderings preserve core demands, the benchmark offers a reproducible, multi-task protocol for probing LLM cognitive capabilities against animal references, with explicit controls for random and graph-RL agents. The empirical scaling, prompting, and architecture ablations are useful for the field. The work's value is reduced by the lack of transparent derivation for the 78.9% rodent figures and limited validation that text interfaces equate to physical task difficulty.

major comments (2)

[Abstract] Abstract: The central claim that LLMs 'remain well below approximate rodent reference values' (78.9% average) is load-bearing for the headline result and interpretation of the performance gap. However, no equation, table, or section details how these rodent percentages were obtained (first-trial naïve performance, multi-session averages, trial counts, or adjustments for loss of olfactory/visual cues in the ASCII zero-shot setting). This leaves the gap sensitive to unstated assumptions about baseline equivalence.
[Task implementation] Task implementation section: The assumption that ASCII renderings accurately capture the cognitive and perceptual demands of the original paradigms (e.g., spatial navigation in Morris water maze or state tracking in delayed non-match to sample) without interface artifacts is central to attributing results to model limitations rather than protocol mismatch. No explicit validation, mapping details, or sensitivity analysis to rendering choices is provided to support this equivalence.

minor comments (2)

[Results] Results: The statement that performance 'ranges from 20% to 57% depending on interface parameters alone' would be strengthened by a table listing the exact parameters varied and the corresponding scores for the best model.
[Abstract] Abstract: Clarify the precise parameter counts and names of all six evaluated models to allow direct replication of the scaling and architecture comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, indicating where revisions will be made to improve transparency.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that LLMs 'remain well below approximate rodent reference values' (78.9% average) is load-bearing for the headline result and interpretation of the performance gap. However, no equation, table, or section details how these rodent percentages were obtained (first-trial naïve performance, multi-session averages, trial counts, or adjustments for loss of olfactory/visual cues in the ASCII zero-shot setting). This leaves the gap sensitive to unstated assumptions about baseline equivalence.

Authors: We agree that greater transparency on the rodent baselines is needed. These approximate values (averaging 78.9%) were compiled from peer-reviewed rodent studies for each of the nine tasks, prioritizing naïve or early-trial performance metrics to best align with the zero-shot protocol. Adjustments for missing olfactory and visual cues were implicit in selecting text-compatible metrics rather than explicitly quantified. In revision we will add an appendix or dedicated subsection listing the source papers, specific performance figures, trial conditions, and any assumptions for each task, allowing readers to evaluate the comparison directly. revision: yes
Referee: [Task implementation] Task implementation section: The assumption that ASCII renderings accurately capture the cognitive and perceptual demands of the original paradigms (e.g., spatial navigation in Morris water maze or state tracking in delayed non-match to sample) without interface artifacts is central to attributing results to model limitations rather than protocol mismatch. No explicit validation, mapping details, or sensitivity analysis to rendering choices is provided to support this equivalence.

Authors: We acknowledge the importance of clarifying the mapping between physical paradigms and ASCII environments. The manuscript already reports that the same model’s success rate varies from 20% to 57% solely due to interface parameter changes, which supports interpreting results as characterizing the agent-plus-interface system rather than isolated model capability. In the revised version we will expand the Task Implementation section with explicit rendering details and mappings for each task (e.g., how spatial layouts and state information are encoded in ASCII), citing the original rodent protocols. While a full empirical equivalence validation would require new physical experiments outside the current scope, we will add discussion of potential artifacts and their implications for the observed LLM–rodent gap. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with external literature baselines

full rationale

The paper introduces CheeseBench as an empirical evaluation of LLMs on ASCII-rendered rodent tasks, reporting observed success rates (e.g., 52.6% for Qwen2.5-VL-7B) against random baselines and approximate rodent values drawn from peer-reviewed protocols. No equations, fitted parameters, predictions, or derivations appear in the provided text. The rodent baselines are explicitly labeled 'approximate' and sourced externally rather than computed or redefined within the paper. All reported findings (scaling effects, context length, CoT, vision-language variants) are direct measurements under the fixed protocol, with no self-referential reduction of outputs to inputs. The central claim therefore rests on external data and experimental results rather than any internal construction that would qualify as circular under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the work rests on standard AI evaluation practices and peer-reviewed rodent protocols for baselines. No explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5592 in / 1192 out tokens · 52676 ms · 2026-05-10T15:17:28.384818+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 unverdicted novelty 6.0

NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdin, M., Awan, A. A., Baldassini, L., et al. (2025). Phi-4-multimodal technical report.arXiv preprint arXiv:2503.01743. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zit- nick, C. L., and Parikh, D. (2015). VQA: Visual ques- tion answering. InProceedings of the IEEE Interna- tional Conference on Computer Vision, pages 2425–

work page internal anchor Pith review arXiv 2025
[2]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-VL: Enhancing vision-language model’s per- ception of the world at any resolution.arXiv preprint arXiv:2502.13923. Chac´on-Fern´andez, P., S´anchez-Campusano, R., Gruart, A., and Delgado-Garc´ıa, J. M. (2016). Long-term tread- mill exerci...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Lu, P., Banerjee, H., et al. (2024). MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.Proceedings of the International Conference on Learning Representations. O’Keefe, J. and Dostrovsky, J. (1971). The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat.Brain Research, 34(1):171–

2024
[4]

Olton, D. S. and Samuelson, R. J. (1976). Remembrance of places passed: Spatial memory in rats.Journal of Experimental Psychology: Animal Behavior Processes, 2(2):97–116. Shoji, H., Hagihara, H., Takao, K., Hattori, S., and Miyakawa, T. (2012). T-maze forced alternation and left-right discrimination tasks for assessing working and reference memory in mice...

1976