BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

David Robinson; Ellen Gilsenan-McMahon; Emmanuel Chemla; Gagan Narula; Jiacheng Shen; Marius Miron; Masato Hagiwara; Mathieu Lauri\`ere; Matthieu Geist; Milad Alizadeh

arxiv: 2604.16241 · v1 · submitted 2026-04-17 · 💻 cs.CL · cs.AI

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

Jiacheng Shen , Masato Hagiwara , Milad Alizadeh , Ellen Gilsenan-McMahon , Marius Miron , David Robinson , Emmanuel Chemla , Sara Keen

show 4 more authors

Gagan Narula Mathieu Lauri\`ere Matthieu Geist Olivier Pietquin

This is my paper

Pith reviewed 2026-05-10 07:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords BAGEL benchmarkanimal knowledgelanguage modelsclosed-book evaluationbiodiversitydomain-specific knowledgetaxonomyspecies interactions

0 comments

The pith

BAGEL introduces a closed-book benchmark to test language models on specialized animal knowledge from scientific sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BAGEL as a benchmark for evaluating how well language models handle specialized animal-related knowledge without access to external information at test time. It builds the benchmark by combining manually curated examples with automatically generated question-answer pairs drawn from sources including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia. These questions span categories such as taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. The benchmark enables detailed breakdowns of performance by source, taxonomic group, and knowledge type, which helps identify where models succeed or fail in this domain. If successful, it serves as a testbed for understanding how language models generalize knowledge in specific fields like biodiversity and for making them more dependable in practical applications.

Core claim

BAGEL is a benchmark for animal knowledge expertise in language models constructed from diverse scientific and reference sources using curated examples and automatically generated closed-book question-answer pairs. It covers multiple aspects of animal knowledge including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions, and supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories to characterize model strengths and failure modes.

What carries the argument

The BAGEL benchmark itself, which enforces closed-book evaluation by generating questions that test animal knowledge drawn directly from scientific sources without allowing external retrieval during testing.

Load-bearing premise

The curated and automatically generated questions from the selected sources accurately reflect unbiased and comprehensive animal knowledge expertise.

What would settle it

Demonstrating that the benchmark questions can be answered correctly by models using only general knowledge patterns rather than specific animal facts, or that the answers conflict with expert biological consensus.

read the original abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BAGEL introduces a closed-book benchmark for animal knowledge in LLMs but supplies no construction details, validation stats, or model results, so its actual value stays untested.

read the letter

The main thing to know is that this paper puts forward BAGEL, a new benchmark meant to test how well language models handle specialized animal facts without retrieval. It pulls material from bioRxiv, Global Biotic Interactions, Xeno-canto, Wikipedia, and similar sources, then splits the content into categories like taxonomy, morphology, habitat, behavior, vocalization, distribution, and interactions. The closed-book framing is a reasonable choice if the goal is to measure internalized knowledge rather than search ability. The plan for fine-grained breakdowns by source, taxonomic group, and category could help pinpoint where models fail in this domain. That structure is the clearest positive element here. The paper also correctly notes that general knowledge benchmarks often overlook narrow scientific areas that matter for biodiversity work. On the downside, the text never explains how the automatic QA pairs were created, what prompts or rules were used, or what human or expert checks were applied. There are no sample questions, no error rates, no inter-annotator numbers, and no baseline scores on any model. Without those pieces the benchmark cannot be evaluated or reproduced, which undercuts the central claim that it supplies a usable testbed. This work is aimed at people building domain-specific evaluations or teams applying LLMs to conservation and biology. A reader might borrow the category list for their own project, but the paper itself offers little concrete material to use or cite. It does not yet deserve peer review. The idea is coherent, but the missing pipeline and results make it too preliminary for referee time. Ask for the generation method, a public sample, and at least some model numbers before considering it further.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BAGEL, a benchmark for evaluating animal knowledge expertise in language models under a closed-book protocol. It is constructed from scientific and reference sources including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia via a mix of curated examples and automatically generated QA pairs. The benchmark spans taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions, with support for fine-grained breakdowns by source, taxonomic group, and knowledge category. The stated goal is to provide a testbed for domain-specific knowledge generalization and reliability in biodiversity applications.

Significance. If the QA construction pipeline produces factually accurate, unbiased, and representative items, BAGEL would fill a genuine gap as one of the first unified closed-book benchmarks focused on animal expertise. It could enable targeted diagnosis of LLM failures in specialized scientific domains and support downstream work on reliability for ecology and conservation tasks. The closed-book design and multi-axis slicing are strengths that distinguish it from general knowledge benchmarks.

major comments (2)

[Benchmark construction] Benchmark construction section: The central claim that BAGEL 'accurately and unbiasedly captures animal knowledge expertise' rests on the quality of the automatically generated closed-book QA pairs. The manuscript supplies no description of the generation algorithm (e.g., extraction rules, LLM prompts, or post-processing), no human/expert validation protocol, no inter-annotator agreement or error-rate statistics, and no public sample of items. This absence makes the accuracy/unbiasedness assumption unverifiable and load-bearing for the benchmark's utility.
[Evaluation and results] Evaluation and results sections: No model evaluations, baseline scores, or error analyses are reported. Without at least preliminary closed-book results on representative LLMs (or an explicit statement that the paper is a benchmark-only release), it is impossible to confirm that BAGEL functions as a usable testbed or to characterize the 'systematic failure modes' claimed in the abstract.

minor comments (2)

[Abstract] The abstract lists seven knowledge categories but does not define their boundaries or report category balance across sources; a short table or appendix clarifying the taxonomy would improve reproducibility.
[Benchmark construction] Source-specific artifacts (e.g., Wikipedia style vs. bioRxiv technical language) are not discussed; adding a brief analysis of potential stylistic or factual biases would strengthen the 'unbiased' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our manuscript on BAGEL. The comments highlight important areas where additional detail and evidence are needed to substantiate the benchmark's claims. We address each major comment below and commit to revisions that will make the construction process verifiable and demonstrate the benchmark's utility through preliminary evaluations.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: The central claim that BAGEL 'accurately and unbiasedly captures animal knowledge expertise' rests on the quality of the automatically generated closed-book QA pairs. The manuscript supplies no description of the generation algorithm (e.g., extraction rules, LLM prompts, or post-processing), no human/expert validation protocol, no inter-annotator agreement or error-rate statistics, and no public sample of items. This absence makes the accuracy/unbiasedness assumption unverifiable and load-bearing for the benchmark's utility.

Authors: We agree that the current manuscript does not provide sufficient detail on the QA generation pipeline, validation, or supporting statistics, which limits the ability to verify the benchmark's quality. This was an omission in the initial draft. In the revised version, we will expand the Benchmark Construction section to include: (1) a full description of the generation algorithm with extraction rules, LLM prompts, and post-processing steps; (2) the human/expert validation protocol; (3) inter-annotator agreement and error-rate statistics; and (4) a public sample of benchmark items (with the full dataset to be released upon acceptance). These additions will make the accuracy and unbiasedness claims verifiable. revision: yes
Referee: [Evaluation and results] Evaluation and results sections: No model evaluations, baseline scores, or error analyses are reported. Without at least preliminary closed-book results on representative LLMs (or an explicit statement that the paper is a benchmark-only release), it is impossible to confirm that BAGEL functions as a usable testbed or to characterize the 'systematic failure modes' claimed in the abstract.

Authors: The manuscript's primary contribution is the introduction of the benchmark itself, with the abstract's reference to characterizing failure modes being prospective based on the benchmark's design. However, we acknowledge that the absence of any evaluations makes it difficult to confirm usability. In the revised manuscript, we will add a new Evaluation section with preliminary closed-book results on a set of representative LLMs (e.g., GPT-4, Llama-3, and smaller models), including baseline accuracy scores across categories and an initial error analysis. This will demonstrate that BAGEL serves as a functional testbed and provide concrete examples of the systematic failure modes it can reveal. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents BAGEL as a benchmark dataset constructed from external sources (bioRxiv, Global Biotic Interactions, Xeno-canto, Wikipedia) via curated examples and automated closed-book QA generation. No equations, parameter fits, predictions, or derivations are present that could reduce to the inputs by construction. The work is an evaluation resource rather than a result derived from prior claims or self-referential steps, making the construction self-contained without circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard NLP benchmark construction practices and the assumption that closed-book QA measures internalized knowledge; no free parameters, new entities, or non-standard axioms are introduced.

axioms (1)

domain assumption Closed-book question answering from curated scientific sources accurately measures a model's internalized animal knowledge without retrieval.
This assumption underpins the entire evaluation protocol described in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1120 out tokens · 41295 ms · 2026-05-10T07:53:58.919332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Can the answer be solved from a single cue word or phrase?

work page
[2]

Would the item still work the same if the species names were replaced with generic placeholders?

work page
[3]

Is the correct option mostly a paraphrase of the stem?

work page
[4]

Does the item require ecological reasoning rather than lexical matching? If the answer to any of 1-3 is yes, strongly prefer rejection. A void wording such as: - ”documented” - ”recorded” - ”reported” - ”observed” - ”in the record” - ”according to the source” - ”the passage states” Return valid JSON in one of these two formats. USABLE: { ”usable”: ”true”,...

work page
[5]

the study system or biological setting,

work page
[6]

the relevant manipulation, comparison, or observation,

work page
[7]

the key evidence or result pattern,

work page
[8]

Rare tok

the reasoning task. - A void copying long sentences directly from the paper. - Lightly rewrite the source information into a natural, self-contained scientific scenario. - Do not make the stem so compressed that the correct answer becomes obvious from wording alone. Before finalizing, check: - Can a model answer the question from the stem alone? - Does th...

work page 2018

[1] [1]

Can the answer be solved from a single cue word or phrase?

work page

[2] [2]

Would the item still work the same if the species names were replaced with generic placeholders?

work page

[3] [3]

Is the correct option mostly a paraphrase of the stem?

work page

[4] [4]

Does the item require ecological reasoning rather than lexical matching? If the answer to any of 1-3 is yes, strongly prefer rejection. A void wording such as: - ”documented” - ”recorded” - ”reported” - ”observed” - ”in the record” - ”according to the source” - ”the passage states” Return valid JSON in one of these two formats. USABLE: { ”usable”: ”true”,...

work page

[5] [5]

the study system or biological setting,

work page

[6] [6]

the relevant manipulation, comparison, or observation,

work page

[7] [7]

the key evidence or result pattern,

work page

[8] [8]

Rare tok

the reasoning task. - A void copying long sentences directly from the paper. - Lightly rewrite the source information into a natural, self-contained scientific scenario. - Do not make the stem so compressed that the correct answer becomes obvious from wording alone. Before finalizing, check: - Can a model answer the question from the stem alone? - Does th...

work page 2018