SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Beier Xiao; Bin Feng; Bohan Wu; Haoran Li; Junwei Yang; Junyu Luo; Kaili Liu; Ming Zhang; Philip S. Yu; Qi Shi

arxiv: 2605.19357 · v1 · pith:ZKRHLRWRnew · submitted 2026-05-19 · 💻 cs.CL

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Yiyang Gu , Junwei Yang , Junyu Luo , Ye Yuan , Bin Feng , Yingce Xia , Shufang Xie , Kaili Liu

show 10 more authors

Bohan Wu Qi Shi Haoran Li Beier Xiao Zhiping Xiao Xiao Luo Weizhi Zhang Philip S. Yu Zequn Liu Ming Zhang

This is my paper

Pith reviewed 2026-05-20 06:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationscientific benchmarkscustom evaluationknowledge unitsontologychemistryhealthcareLLM capabilities

0 comments

The pith

SciCustom builds custom LLM benchmarks from scientific data by tagging knowledge units and retrieving relevant examples without expert input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SciCustom as a way to evaluate large language models on specific scientific tasks by organizing knowledge into controlled ontology-based units. A tagger is trained to label data instances, and multiple models vote to select units matching a given custom requirement. This allows efficient retrieval of relevant data to generate benchmarks. Tests in chemistry and healthcare show it uncovers capability differences that generic benchmarks miss. The approach scales by using existing data rather than creating new questions manually.

Core claim

SciCustom organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity, trains a tagger to map data, identifies relevant units through voting-based multi-model consensus for custom requirements, and generates benchmarks via relevance-aware retrieval and proxy subset selection.

What carries the argument

Ontology-grounded knowledge units identified by voting-based multi-model consensus that support relevance-aware benchmark retrieval from large datasets.

If this is right

Standard benchmarks overlook fine-grained differences in LLM scientific capabilities.
Custom benchmarks can be built without expert annotation or synthetic question generation.
The framework provides a scalable foundation for application-aware evaluation in science domains.
Experiments confirm utility in chemistry and healthcare domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tagging and consensus methods could adapt to non-scientific domains like law or engineering for custom evaluations.
Updating the knowledge units as new scientific data emerges might keep benchmarks current over time.
Integration with existing large datasets could reduce the cost of repeated evaluations for different requirements.

Load-bearing premise

Ontology-grounded knowledge units with controlled granularity accurately represent fine-grained scientific capabilities, and the tagger with voting consensus reliably identifies relevant units for custom needs.

What would settle it

A test in a new scientific field where SciCustom-generated benchmarks fail to show more differentiation between LLMs than standard benchmarks or require comparable expert effort to create.

Figures

Figures reproduced from arXiv: 2605.19357 by Beier Xiao, Bin Feng, Bohan Wu, Haoran Li, Junwei Yang, Junyu Luo, Kaili Liu, Ming Zhang, Philip S. Yu, Qi Shi, Shufang Xie, Weizhi Zhang, Xiao Luo, Ye Yuan, Yingce Xia, Yiyang Gu, Zequn Liu, Zhiping Xiao.

**Figure 1.** Figure 1: Illustrations of SCICUSTOM. (a) Comparison between traditional off-the-shelf benchmarking and our ontology-driven framework. (b, c) Evaluation of 10 LLMs on different benchmarks, where each dot represents a model. Targeting specific capabilities in Technical Chemistry, (b) the general scientific benchmark (GPQA Diamond) aligns poorly with expert ground truth, whereas (c) the benchmark constructed by SCI… view at source ↗

**Figure 2.** Figure 2: Framework of SCICUSTOM. It consists of an offline phase where scientific data is indexed into ontologygrounded knowledge units via a trained tagger, and an online phase where user requirements are parsed by multi-model voting to identify relevant tags. These tags guide the binary search-based selection and proxy selection of data. The problem set of the benchmark is generated based on these data. relevant… view at source ↗

**Figure 3.** Figure 3: Illustration of the synthetic data construction [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Bar plot showing the effectiveness of relevance cutoff and subset selection strategies. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Case study on constructing a benchmark for “Pericyclic Reaction”. The pipeline progresses from [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciCustom gives a practical pipeline for custom scientific LLM benchmarks from real data but skips validation on the tagger and consensus steps.

read the letter

Dear colleague, SciCustom is a framework for building custom benchmarks to evaluate LLMs on specific scientific tasks by leveraging large-scale data and ontologies instead of manual curation. The new part is the end-to-end process: ontology-grounded knowledge units, a trained tagger for mapping data, voting-based consensus to select relevant units for a custom need, binary search for retrieval, and proxy selection for benchmark generation. This combination aims to make evaluation scalable and aligned with actual use cases. The experiments in chemistry and healthcare are presented as evidence that it can detect fine-grained differences overlooked by standard benchmarks, all while avoiding expert annotation and synthetic data generation. Releasing the code supports reproducibility. The soft spots center on missing validation. The paper does not report accuracy metrics for the tagger or agreement scores between the consensus selections and expert choices. This leaves open the possibility that any observed differences stem from biases in the voting models rather than true capability variations. Details on exact metrics, baselines, and error analysis are also thin in the description. The work is for researchers developing or applying LLM evaluations in scientific domains. Readers interested in practical tools for domain-specific testing would find the pipeline and code useful to explore. I would recommend sending it for peer review. The core idea addresses a genuine need and the public code allows for further checking, though the experimental support requires more rigor to be fully convincing.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SciCustom, a framework for constructing custom benchmarks to evaluate fine-grained scientific capabilities in LLMs. Scientific knowledge is organized into ontology-grounded units of controlled granularity; a tagger is trained to map large-scale data instances into this space; given a custom requirement, relevant units are selected via voting-based multi-model consensus; benchmarks are then retrieved via binary search, with proxy subset selection and data-grounded generation. Experiments in chemistry and healthcare are claimed to show that SciCustom uncovers LLM capability differences overlooked by standard benchmarks, without expert annotation or synthetic question generation. Source code is released.

Significance. If the framework and its experimental outcomes hold after validation, the work would provide a scalable, data-driven alternative to manually curated or generic benchmarks for scientific LLM evaluation. This could improve alignment with real-world scientific use cases in domains such as chemistry and healthcare. The open-source release supports reproducibility and extension.

major comments (1)

[Experiments] Experiments section: The central claim that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook depends on the reliability of the ontology-grounded units, the trained tagger, and the voting-based multi-model consensus. No precision/recall figures for the tagger on held-out scientific instances, nor inter-rater agreement metrics (e.g., Cohen’s kappa) between consensus-selected units and domain-expert selections, are reported for the chemistry or healthcare experiments. Without these, it is impossible to exclude that observed differences arise from biases in the multi-model voting step rather than genuine capability distinctions.

minor comments (2)

[Abstract] Abstract: The description of 'proxy subset selection and data-grounded benchmark generation' lacks any indication of the selection criteria, algorithms, or efficiency metrics employed; adding a brief outline would improve clarity for readers.
[Framework description] The manuscript would benefit from an explicit statement of the ontology construction process and the training data used for the tagger, even at a high level, to allow assessment of generality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the validation of key components in SciCustom. We address the major comment on experimental reliability below and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: The central claim that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook depends on the reliability of the ontology-grounded units, the trained tagger, and the voting-based multi-model consensus. No precision/recall figures for the tagger on held-out scientific instances, nor inter-rater agreement metrics (e.g., Cohen’s kappa) between consensus-selected units and domain-expert selections, are reported for the chemistry or healthcare experiments. Without these, it is impossible to exclude that observed differences arise from biases in the multi-model voting step rather than genuine capability distinctions.

Authors: We agree that quantitative validation of the tagger and consensus mechanism is important for supporting the central claim. The current manuscript prioritizes the end-to-end framework and its ability to surface capability gaps relative to standard benchmarks, but does not include held-out precision/recall for the tagger or expert agreement metrics for the voting-based selection. In the revised manuscript we will add a dedicated validation subsection reporting precision and recall for the tagger on held-out chemistry and healthcare instances. We will also include results from a small-scale expert study in which domain specialists independently annotated relevant units for a sample of custom requirements; Cohen’s kappa will be reported between these expert selections and the multi-model consensus outputs. These additions will directly address the possibility of voting bias and provide clearer evidence that observed LLM differences reflect genuine capability distinctions. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive framework with independent components

full rationale

The paper describes SciCustom as a constructive pipeline: ontology-grounded knowledge units are organized from scientific data, a tagger is trained to map instances, and voting-based multi-model consensus identifies relevant units for custom requirements, followed by retrieval and benchmark generation. No equations, predictions, or first-principles results are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on the described method's ability to surface fine-grained capabilities without expert annotation, and experiments in chemistry/healthcare are presented as empirical demonstrations rather than tautological outputs. This matches the default expectation of a non-circular framework description.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about knowledge organization and automated mapping rather than new mathematical derivations or fitted parameters.

axioms (2)

domain assumption Scientific knowledge can be organized into ontology-grounded knowledge units with controlled granularity.
Invoked as the starting point for organizing data and enabling custom selection.
domain assumption A tagger can be trained to accurately map large-scale data instances into the knowledge space.
Required for the mapping step before consensus and retrieval.

pith-pipeline@v0.9.0 · 5794 in / 1261 out tokens · 44305 ms · 2026-05-20T06:40:59.755420+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SCICUSTOM first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We select concepts from this ontology as knowledge units... depth-first traversal over each ontology DAG Gi... LLM classifies v as coarse, moderate, or fine

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

InInternational Conference on Learning Representations

Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Co- ley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

work page
[2]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Therapeutics data commons: Machine learn- ing datasets and tasks for drug discovery and develop- ment. InThirty-fifth Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track (Round 1). Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale o...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran

Nature language model: Deciphering the lan- guage of nature for scientific discovery.Preprint, arXiv:2502.07527. Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. 2025. Kodcode: A diverse, challenging, and verifiable synthetic dataset for cod- ing. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 6980–7...

work page arXiv 2025
[4]

Greedy Top-K Selection

framework for efficient large-scale model inference. We constructed a comprehensive scientific data corpus by aggregating diverse high-quality instruction-tuning datasets and benchmarks, includ- ingSciRIFF(Wadden et al., 2025),SciInstruct (Zhang et al., 2024),Mol-Instruct(Fang et al., 2024),MultiMedQA(Singhal et al., 2023),SciEval (Sun et al., 2024),MMLU-...

work page 2025
[5]

• 1 (Relevant):The question is strictly aligned with the requirement

Relevance Label (relevant) Determine whether the question requires specific knowledge of the target requirement. • 1 (Relevant):The question is strictly aligned with the requirement. • 0 (Irrelevant):The question is off-topic, generic (can be answered by a layperson), or belongs to a distinctly different scientific field

work page
[6]

Histone Acetylation Prediction

Correctness Label (correct) Evaluate the scientific accuracy of the MCQs. • 1 (Correct):The questions and options are scientifically accurate. • 0 (Incorrect):The question is wrong, or the selected option is factually wrong, scientifi- cally flawed, or there is a significantly better/- more accurate option available in the choices. E.2 Human Annotators Th...

work page
[7]

Focus on core concepts, expert-level knowledge, and non-trivial reasoning in this domain

work page
[8]

Avoid trivial definitions, purely factual memoriza- tion, or overly ambiguous questions

work page
[9]

Include a mix of: - Conceptual understanding - Mechanism or principle-based reasoning - Application or scenario-based reasoning

work page
[10]

Question format:

Be answerable without external tools, but not solvable by surface-level pattern matching. Question format:

work page
[11]

Each question must have 4–5 options

work page
[12]

Options should be concise and mutually exclusive

work page
[13]

query":

Each question have only one correct answers. Output format (STRICT): Return only a JSON array of length{K}. Each element must have the following structure: {{ "query": "<question text with options labeled A, B, C, D (and E if applicable)>", "answer": "<correct option label>" }} MCQ transformation System: You are an expert in{domain}and tasked with curatin...

work page
[14]

Ensure the formatting aligns with the output requirements

Format Adaptation: - If the input is already a multiple-choice question: Preserve the original stem and options exactly. Ensure the formatting aligns with the output requirements. - If the input is not a multiple-choice question: Convert it into a single-choice question by generating 3–4 incorrect options (distractors)

work page
[15]

Distractor Engineering: - Avoid trivial errors, logical fallacies that are easily filtered, or clearly unrelated concepts

work page
[16]

- Do not simplify the problem complexity

Fidelity & Difficulty: - Strict adherence to the factual truth and reasoning logic of the original content is required. - Do not simplify the problem complexity. The re- sulting MCQ must maintain the same discriminative power as the original input

work page
[17]

Question format:

Exclusivity: Ensure there is exactly one indisputably correct option. Question format:

work page
[18]

The final output must contain 4–5 options (A, B, C, D, [E])

work page
[19]

query":

Options should be concise and mutually exclusive. Output format (STRICT): Return only a single JSON object. The object must have the following structure: {{ "query": "<question stem followed by options labeled A, B, C, D (and E if applicable), separated by newlines>", "answer": "<correct option label, e.g., ’A’>" }} H LLMs Usage We adhere to the ACL Code ...

work page

[1] [1]

InInternational Conference on Learning Representations

Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Co- ley, Cao Xiao, Jimeng Sun, and Marinka Zitnik

work page

[2] [2]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Therapeutics data commons: Machine learn- ing datasets and tasks for drug discovery and develop- ment. InThirty-fifth Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track (Round 1). Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale o...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran

Nature language model: Deciphering the lan- guage of nature for scientific discovery.Preprint, arXiv:2502.07527. Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. 2025. Kodcode: A diverse, challenging, and verifiable synthetic dataset for cod- ing. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 6980–7...

work page arXiv 2025

[4] [4]

Greedy Top-K Selection

framework for efficient large-scale model inference. We constructed a comprehensive scientific data corpus by aggregating diverse high-quality instruction-tuning datasets and benchmarks, includ- ingSciRIFF(Wadden et al., 2025),SciInstruct (Zhang et al., 2024),Mol-Instruct(Fang et al., 2024),MultiMedQA(Singhal et al., 2023),SciEval (Sun et al., 2024),MMLU-...

work page 2025

[5] [5]

• 1 (Relevant):The question is strictly aligned with the requirement

Relevance Label (relevant) Determine whether the question requires specific knowledge of the target requirement. • 1 (Relevant):The question is strictly aligned with the requirement. • 0 (Irrelevant):The question is off-topic, generic (can be answered by a layperson), or belongs to a distinctly different scientific field

work page

[6] [6]

Histone Acetylation Prediction

Correctness Label (correct) Evaluate the scientific accuracy of the MCQs. • 1 (Correct):The questions and options are scientifically accurate. • 0 (Incorrect):The question is wrong, or the selected option is factually wrong, scientifi- cally flawed, or there is a significantly better/- more accurate option available in the choices. E.2 Human Annotators Th...

work page

[7] [7]

Focus on core concepts, expert-level knowledge, and non-trivial reasoning in this domain

work page

[8] [8]

Avoid trivial definitions, purely factual memoriza- tion, or overly ambiguous questions

work page

[9] [9]

Include a mix of: - Conceptual understanding - Mechanism or principle-based reasoning - Application or scenario-based reasoning

work page

[10] [10]

Question format:

Be answerable without external tools, but not solvable by surface-level pattern matching. Question format:

work page

[11] [11]

Each question must have 4–5 options

work page

[12] [12]

Options should be concise and mutually exclusive

work page

[13] [13]

query":

Each question have only one correct answers. Output format (STRICT): Return only a JSON array of length{K}. Each element must have the following structure: {{ "query": "<question text with options labeled A, B, C, D (and E if applicable)>", "answer": "<correct option label>" }} MCQ transformation System: You are an expert in{domain}and tasked with curatin...

work page

[14] [14]

Ensure the formatting aligns with the output requirements

Format Adaptation: - If the input is already a multiple-choice question: Preserve the original stem and options exactly. Ensure the formatting aligns with the output requirements. - If the input is not a multiple-choice question: Convert it into a single-choice question by generating 3–4 incorrect options (distractors)

work page

[15] [15]

Distractor Engineering: - Avoid trivial errors, logical fallacies that are easily filtered, or clearly unrelated concepts

work page

[16] [16]

- Do not simplify the problem complexity

Fidelity & Difficulty: - Strict adherence to the factual truth and reasoning logic of the original content is required. - Do not simplify the problem complexity. The re- sulting MCQ must maintain the same discriminative power as the original input

work page

[17] [17]

Question format:

Exclusivity: Ensure there is exactly one indisputably correct option. Question format:

work page

[18] [18]

The final output must contain 4–5 options (A, B, C, D, [E])

work page

[19] [19]

query":

Options should be concise and mutually exclusive. Output format (STRICT): Return only a single JSON object. The object must have the following structure: {{ "query": "<question stem followed by options labeled A, B, C, D (and E if applicable), separated by newlines>", "answer": "<correct option label, e.g., ’A’>" }} H LLMs Usage We adhere to the ACL Code ...

work page