BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
Pith reviewed 2026-05-10 07:53 UTC · model grok-4.3
The pith
BAGEL introduces a closed-book benchmark to test language models on specialized animal knowledge from scientific sources.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BAGEL is a benchmark for animal knowledge expertise in language models constructed from diverse scientific and reference sources using curated examples and automatically generated closed-book question-answer pairs. It covers multiple aspects of animal knowledge including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions, and supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories to characterize model strengths and failure modes.
What carries the argument
The BAGEL benchmark itself, which enforces closed-book evaluation by generating questions that test animal knowledge drawn directly from scientific sources without allowing external retrieval during testing.
Load-bearing premise
The curated and automatically generated questions from the selected sources accurately reflect unbiased and comprehensive animal knowledge expertise.
What would settle it
Demonstrating that the benchmark questions can be answered correctly by models using only general knowledge patterns rather than specific animal facts, or that the answers conflict with expert biological consensus.
read the original abstract
Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces BAGEL, a benchmark for evaluating animal knowledge expertise in language models under a closed-book protocol. It is constructed from scientific and reference sources including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia via a mix of curated examples and automatically generated QA pairs. The benchmark spans taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions, with support for fine-grained breakdowns by source, taxonomic group, and knowledge category. The stated goal is to provide a testbed for domain-specific knowledge generalization and reliability in biodiversity applications.
Significance. If the QA construction pipeline produces factually accurate, unbiased, and representative items, BAGEL would fill a genuine gap as one of the first unified closed-book benchmarks focused on animal expertise. It could enable targeted diagnosis of LLM failures in specialized scientific domains and support downstream work on reliability for ecology and conservation tasks. The closed-book design and multi-axis slicing are strengths that distinguish it from general knowledge benchmarks.
major comments (2)
- [Benchmark construction] Benchmark construction section: The central claim that BAGEL 'accurately and unbiasedly captures animal knowledge expertise' rests on the quality of the automatically generated closed-book QA pairs. The manuscript supplies no description of the generation algorithm (e.g., extraction rules, LLM prompts, or post-processing), no human/expert validation protocol, no inter-annotator agreement or error-rate statistics, and no public sample of items. This absence makes the accuracy/unbiasedness assumption unverifiable and load-bearing for the benchmark's utility.
- [Evaluation and results] Evaluation and results sections: No model evaluations, baseline scores, or error analyses are reported. Without at least preliminary closed-book results on representative LLMs (or an explicit statement that the paper is a benchmark-only release), it is impossible to confirm that BAGEL functions as a usable testbed or to characterize the 'systematic failure modes' claimed in the abstract.
minor comments (2)
- [Abstract] The abstract lists seven knowledge categories but does not define their boundaries or report category balance across sources; a short table or appendix clarifying the taxonomy would improve reproducibility.
- [Benchmark construction] Source-specific artifacts (e.g., Wikipedia style vs. bioRxiv technical language) are not discussed; adding a brief analysis of potential stylistic or factual biases would strengthen the 'unbiased' claim.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our manuscript on BAGEL. The comments highlight important areas where additional detail and evidence are needed to substantiate the benchmark's claims. We address each major comment below and commit to revisions that will make the construction process verifiable and demonstrate the benchmark's utility through preliminary evaluations.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: The central claim that BAGEL 'accurately and unbiasedly captures animal knowledge expertise' rests on the quality of the automatically generated closed-book QA pairs. The manuscript supplies no description of the generation algorithm (e.g., extraction rules, LLM prompts, or post-processing), no human/expert validation protocol, no inter-annotator agreement or error-rate statistics, and no public sample of items. This absence makes the accuracy/unbiasedness assumption unverifiable and load-bearing for the benchmark's utility.
Authors: We agree that the current manuscript does not provide sufficient detail on the QA generation pipeline, validation, or supporting statistics, which limits the ability to verify the benchmark's quality. This was an omission in the initial draft. In the revised version, we will expand the Benchmark Construction section to include: (1) a full description of the generation algorithm with extraction rules, LLM prompts, and post-processing steps; (2) the human/expert validation protocol; (3) inter-annotator agreement and error-rate statistics; and (4) a public sample of benchmark items (with the full dataset to be released upon acceptance). These additions will make the accuracy and unbiasedness claims verifiable. revision: yes
-
Referee: [Evaluation and results] Evaluation and results sections: No model evaluations, baseline scores, or error analyses are reported. Without at least preliminary closed-book results on representative LLMs (or an explicit statement that the paper is a benchmark-only release), it is impossible to confirm that BAGEL functions as a usable testbed or to characterize the 'systematic failure modes' claimed in the abstract.
Authors: The manuscript's primary contribution is the introduction of the benchmark itself, with the abstract's reference to characterizing failure modes being prospective based on the benchmark's design. However, we acknowledge that the absence of any evaluations makes it difficult to confirm usability. In the revised manuscript, we will add a new Evaluation section with preliminary closed-book results on a set of representative LLMs (e.g., GPT-4, Llama-3, and smaller models), including baseline accuracy scores across categories and an initial error analysis. This will demonstrate that BAGEL serves as a functional testbed and provide concrete examples of the systematic failure modes it can reveal. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents BAGEL as a benchmark dataset constructed from external sources (bioRxiv, Global Biotic Interactions, Xeno-canto, Wikipedia) via curated examples and automated closed-book QA generation. No equations, parameter fits, predictions, or derivations are present that could reduce to the inputs by construction. The work is an evaluation resource rather than a result derived from prior claims or self-referential steps, making the construction self-contained without circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Closed-book question answering from curated scientific sources accurately measures a model's internalized animal knowledge without retrieval.
Reference graph
Works this paper leans on
-
[1]
Can the answer be solved from a single cue word or phrase?
-
[2]
Would the item still work the same if the species names were replaced with generic placeholders?
-
[3]
Is the correct option mostly a paraphrase of the stem?
-
[4]
Does the item require ecological reasoning rather than lexical matching? If the answer to any of 1-3 is yes, strongly prefer rejection. A void wording such as: - ”documented” - ”recorded” - ”reported” - ”observed” - ”in the record” - ”according to the source” - ”the passage states” Return valid JSON in one of these two formats. USABLE: { ”usable”: ”true”,...
-
[5]
the study system or biological setting,
-
[6]
the relevant manipulation, comparison, or observation,
-
[7]
the key evidence or result pattern,
-
[8]
the reasoning task. - A void copying long sentences directly from the paper. - Lightly rewrite the source information into a natural, self-contained scientific scenario. - Do not make the stem so compressed that the correct answer becomes obvious from wording alone. Before finalizing, check: - Can a model answer the question from the stem alone? - Does th...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.