SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models
Pith reviewed 2026-05-20 06:40 UTC · model grok-4.3
The pith
SciCustom builds custom LLM benchmarks from scientific data by tagging knowledge units and retrieving relevant examples without expert input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SciCustom organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity, trains a tagger to map data, identifies relevant units through voting-based multi-model consensus for custom requirements, and generates benchmarks via relevance-aware retrieval and proxy subset selection.
What carries the argument
Ontology-grounded knowledge units identified by voting-based multi-model consensus that support relevance-aware benchmark retrieval from large datasets.
If this is right
- Standard benchmarks overlook fine-grained differences in LLM scientific capabilities.
- Custom benchmarks can be built without expert annotation or synthetic question generation.
- The framework provides a scalable foundation for application-aware evaluation in science domains.
- Experiments confirm utility in chemistry and healthcare domains.
Where Pith is reading between the lines
- Similar tagging and consensus methods could adapt to non-scientific domains like law or engineering for custom evaluations.
- Updating the knowledge units as new scientific data emerges might keep benchmarks current over time.
- Integration with existing large datasets could reduce the cost of repeated evaluations for different requirements.
Load-bearing premise
Ontology-grounded knowledge units with controlled granularity accurately represent fine-grained scientific capabilities, and the tagger with voting consensus reliably identifies relevant units for custom needs.
What would settle it
A test in a new scientific field where SciCustom-generated benchmarks fail to show more differentiation between LLMs than standard benchmarks or require comparable expert effort to create.
Figures
read the original abstract
Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SciCustom, a framework for constructing custom benchmarks to evaluate fine-grained scientific capabilities in LLMs. Scientific knowledge is organized into ontology-grounded units of controlled granularity; a tagger is trained to map large-scale data instances into this space; given a custom requirement, relevant units are selected via voting-based multi-model consensus; benchmarks are then retrieved via binary search, with proxy subset selection and data-grounded generation. Experiments in chemistry and healthcare are claimed to show that SciCustom uncovers LLM capability differences overlooked by standard benchmarks, without expert annotation or synthetic question generation. Source code is released.
Significance. If the framework and its experimental outcomes hold after validation, the work would provide a scalable, data-driven alternative to manually curated or generic benchmarks for scientific LLM evaluation. This could improve alignment with real-world scientific use cases in domains such as chemistry and healthcare. The open-source release supports reproducibility and extension.
major comments (1)
- [Experiments] Experiments section: The central claim that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook depends on the reliability of the ontology-grounded units, the trained tagger, and the voting-based multi-model consensus. No precision/recall figures for the tagger on held-out scientific instances, nor inter-rater agreement metrics (e.g., Cohen’s kappa) between consensus-selected units and domain-expert selections, are reported for the chemistry or healthcare experiments. Without these, it is impossible to exclude that observed differences arise from biases in the multi-model voting step rather than genuine capability distinctions.
minor comments (2)
- [Abstract] Abstract: The description of 'proxy subset selection and data-grounded benchmark generation' lacks any indication of the selection criteria, algorithms, or efficiency metrics employed; adding a brief outline would improve clarity for readers.
- [Framework description] The manuscript would benefit from an explicit statement of the ontology construction process and the training data used for the tagger, even at a high level, to allow assessment of generality.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the validation of key components in SciCustom. We address the major comment on experimental reliability below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The central claim that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook depends on the reliability of the ontology-grounded units, the trained tagger, and the voting-based multi-model consensus. No precision/recall figures for the tagger on held-out scientific instances, nor inter-rater agreement metrics (e.g., Cohen’s kappa) between consensus-selected units and domain-expert selections, are reported for the chemistry or healthcare experiments. Without these, it is impossible to exclude that observed differences arise from biases in the multi-model voting step rather than genuine capability distinctions.
Authors: We agree that quantitative validation of the tagger and consensus mechanism is important for supporting the central claim. The current manuscript prioritizes the end-to-end framework and its ability to surface capability gaps relative to standard benchmarks, but does not include held-out precision/recall for the tagger or expert agreement metrics for the voting-based selection. In the revised manuscript we will add a dedicated validation subsection reporting precision and recall for the tagger on held-out chemistry and healthcare instances. We will also include results from a small-scale expert study in which domain specialists independently annotated relevant units for a sample of custom requirements; Cohen’s kappa will be reported between these expert selections and the multi-model consensus outputs. These additions will directly address the possibility of voting bias and provide clearer evidence that observed LLM differences reflect genuine capability distinctions. revision: yes
Circularity Check
No circularity: constructive framework with independent components
full rationale
The paper describes SciCustom as a constructive pipeline: ontology-grounded knowledge units are organized from scientific data, a tagger is trained to map instances, and voting-based multi-model consensus identifies relevant units for custom requirements, followed by retrieval and benchmark generation. No equations, predictions, or first-principles results are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on the described method's ability to surface fine-grained capabilities without expert annotation, and experiments in chemistry/healthcare are presented as empirical demonstrations rather than tautological outputs. This matches the default expectation of a non-circular framework description.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Scientific knowledge can be organized into ontology-grounded knowledge units with controlled granularity.
- domain assumption A tagger can be trained to accurately map large-scale data instances into the knowledge space.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SCICUSTOM first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We select concepts from this ontology as knowledge units... depth-first traversal over each ontology DAG Gi... LLM classifies v as coarse, moderate, or fine
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
InInternational Conference on Learning Representations
Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf H Roohani, Jure Leskovec, Connor W. Co- ley, Cao Xiao, Jimeng Sun, and Marinka Zitnik
-
[2]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Therapeutics data commons: Machine learn- ing datasets and tasks for drug discovery and develop- ment. InThirty-fifth Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track (Round 1). Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale o...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran
Nature language model: Deciphering the lan- guage of nature for scientific discovery.Preprint, arXiv:2502.07527. Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. 2025. Kodcode: A diverse, challenging, and verifiable synthetic dataset for cod- ing. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 6980–7...
-
[4]
framework for efficient large-scale model inference. We constructed a comprehensive scientific data corpus by aggregating diverse high-quality instruction-tuning datasets and benchmarks, includ- ingSciRIFF(Wadden et al., 2025),SciInstruct (Zhang et al., 2024),Mol-Instruct(Fang et al., 2024),MultiMedQA(Singhal et al., 2023),SciEval (Sun et al., 2024),MMLU-...
work page 2025
-
[5]
• 1 (Relevant):The question is strictly aligned with the requirement
Relevance Label (relevant) Determine whether the question requires specific knowledge of the target requirement. • 1 (Relevant):The question is strictly aligned with the requirement. • 0 (Irrelevant):The question is off-topic, generic (can be answered by a layperson), or belongs to a distinctly different scientific field
-
[6]
Histone Acetylation Prediction
Correctness Label (correct) Evaluate the scientific accuracy of the MCQs. • 1 (Correct):The questions and options are scientifically accurate. • 0 (Incorrect):The question is wrong, or the selected option is factually wrong, scientifi- cally flawed, or there is a significantly better/- more accurate option available in the choices. E.2 Human Annotators Th...
-
[7]
Focus on core concepts, expert-level knowledge, and non-trivial reasoning in this domain
-
[8]
Avoid trivial definitions, purely factual memoriza- tion, or overly ambiguous questions
-
[9]
Include a mix of: - Conceptual understanding - Mechanism or principle-based reasoning - Application or scenario-based reasoning
-
[10]
Be answerable without external tools, but not solvable by surface-level pattern matching. Question format:
-
[11]
Each question must have 4–5 options
-
[12]
Options should be concise and mutually exclusive
-
[13]
Each question have only one correct answers. Output format (STRICT): Return only a JSON array of length{K}. Each element must have the following structure: {{ "query": "<question text with options labeled A, B, C, D (and E if applicable)>", "answer": "<correct option label>" }} MCQ transformation System: You are an expert in{domain}and tasked with curatin...
-
[14]
Ensure the formatting aligns with the output requirements
Format Adaptation: - If the input is already a multiple-choice question: Preserve the original stem and options exactly. Ensure the formatting aligns with the output requirements. - If the input is not a multiple-choice question: Convert it into a single-choice question by generating 3–4 incorrect options (distractors)
-
[15]
Distractor Engineering: - Avoid trivial errors, logical fallacies that are easily filtered, or clearly unrelated concepts
-
[16]
- Do not simplify the problem complexity
Fidelity & Difficulty: - Strict adherence to the factual truth and reasoning logic of the original content is required. - Do not simplify the problem complexity. The re- sulting MCQ must maintain the same discriminative power as the original input
-
[17]
Exclusivity: Ensure there is exactly one indisputably correct option. Question format:
-
[18]
The final output must contain 4–5 options (A, B, C, D, [E])
-
[19]
Options should be concise and mutually exclusive. Output format (STRICT): Return only a single JSON object. The object must have the following structure: {{ "query": "<question stem followed by options labeled A, B, C, D (and E if applicable), separated by newlines>", "answer": "<correct option label, e.g., ’A’>" }} H LLMs Usage We adhere to the ACL Code ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.