Recognition: no theorem link
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
Pith reviewed 2026-05-15 17:55 UTC · model grok-4.3
The pith
SafeSci provides a 250k-sample benchmark and 1.5M-sample training set that distinguish safety knowledge from risk to diagnose and reduce vulnerabilities across 24 LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SafeSciBench and SafeSciTrain together enable systematic safety evaluation by separating safety knowledge from risk, applying objective metrics to avoid subjective bias, and covering 0.25 million evaluation samples plus 1.5 million training samples. When applied to 24 advanced LLMs, the benchmark reveals widespread vulnerabilities; fine-tuning on SafeSciTrain then significantly improves safety alignment. The authors conclude that determining whether a scientific question is safe requires specific context rather than universal labeling.
What carries the argument
SafeSciBench, a benchmark that distinguishes safety knowledge from risk using objective, deterministically answerable questions to cover broad scientific scopes without subjective scoring.
If this is right
- 24 advanced LLMs display critical safety vulnerabilities when tested on scientific topics.
- Fine-tuning on SafeSciTrain produces measurable gains in safety alignment for the tested models.
- LLMs exhibit varying degrees of excessive refusal on safety-related scientific questions.
- Safety judgments for scientific content cannot be made by fixed categories and must incorporate context.
Where Pith is reading between the lines
- The same knowledge-versus-risk framing could be adapted to non-scientific domains such as law or medicine to reduce benchmark subjectivity elsewhere.
- Context-dependent safety labels might support more granular guardrail systems that allow safe technical discussion while blocking harmful applications.
- Large-scale synthetic datasets like SafeSciTrain could be used to study how safety training affects general capability retention over multiple fine-tuning rounds.
Load-bearing premise
The selected objective metrics, the knowledge-versus-risk distinction, and the scale of the two datasets together capture the full range of real-world scientific safety issues without introducing selection biases or new blind spots.
What would settle it
A controlled test showing that models fine-tuned on SafeSciTrain still produce unsafe answers on a fresh set of scientific queries drawn from actual laboratory incidents or regulatory cases not represented in the original 1.5M samples.
read the original abstract
The success of large language models (LLMs) in scientific domains has heightened safety concerns, prompting numerous benchmarks to evaluate their scientific safety. Existing benchmarks often suffer from limited risk coverage and a reliance on subjective evaluation. To address these problems, we introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts. SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement. SafeSciBench distinguishes between safety knowledge and risk to cover extensive scopes and employs objective metrics such as deterministically answerable questions to mitigate evaluation bias. We evaluate 24 advanced LLMs, revealing critical vulnerabilities in current models. We also observe that LLMs exhibit varying degrees of excessive refusal behaviors on safety-related issues. For safety enhancement, we demonstrate that fine-tuning on SafeSciTrain significantly enhances the safety alignment of models. Finally, we argue that knowledge is a double-edged sword, and determining the safety of a scientific question should depend on specific context, rather than universally categorizing it as safe or unsafe. Our work provides both a diagnostic tool and a practical resource for building safer scientific AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the SafeSci framework for LLM safety in scientific domains, consisting of SafeSciBench (a 0.25M-sample multi-disciplinary benchmark that distinguishes safety knowledge from risk using objective, deterministically answerable questions to reduce bias) and SafeSciTrain (a 1.5M-sample dataset for safety enhancement via fine-tuning). It reports evaluation results on 24 advanced LLMs showing critical vulnerabilities and varying excessive refusal behaviors, demonstrates that fine-tuning on SafeSciTrain improves safety alignment, and concludes that scientific safety should be assessed in specific contexts rather than via universal safe/unsafe labels.
Significance. If the benchmark construction and evaluation protocols hold, SafeSci would provide a substantial empirical resource for diagnosing and mitigating safety issues in scientific LLMs, with broader risk coverage than prior benchmarks and concrete evidence from 24-model evaluations plus fine-tuning gains. The scale of the datasets and the emphasis on objective metrics represent practical strengths for reproducibility and community use in building safer scientific AI systems.
major comments (2)
- Abstract: The central claim that SafeSciBench enables comprehensive evaluation via fixed safety labels and objective metrics is placed in tension by the paper's own conclusion that safety 'should depend on specific context, rather than universally categorizing it as safe or unsafe.' This risks systematic misclassification for context-sensitive queries (e.g., a chemistry synthesis question that is safe under controlled lab conditions but risky otherwise), which directly undermines the reported 'critical vulnerabilities' findings and the assertion of reduced evaluation bias.
- Abstract: The distinction between safety knowledge and risk is presented as load-bearing for coverage and bias reduction, yet no details are given on how this distinction was operationalized across the 0.25M samples, how context was (or was not) encoded in the deterministic questions, or how the 1.5M SafeSciTrain samples were filtered to avoid introducing new blind spots; without these, the weakest assumption noted in the review cannot be evaluated.
minor comments (2)
- Abstract: The mention of 'excessive refusal behaviors' on safety-related issues is not quantified or exemplified; adding concrete metrics or examples in the main text would clarify the observation.
- Abstract: Dataset sizes are given as 0.25M and 1.5M but without breakdown by discipline or risk category; a table summarizing coverage would aid assessment of multi-disciplinary claims.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will be made.
read point-by-point responses
-
Referee: Abstract: The central claim that SafeSciBench enables comprehensive evaluation via fixed safety labels and objective metrics is placed in tension by the paper's own conclusion that safety 'should depend on specific context, rather than universally categorizing it as safe or unsafe.' This risks systematic misclassification for context-sensitive queries (e.g., a chemistry synthesis question that is safe under controlled lab conditions but risky otherwise), which directly undermines the reported 'critical vulnerabilities' findings and the assertion of reduced evaluation bias.
Authors: We acknowledge the referee's observation of an apparent tension. SafeSciBench constructs questions with embedded contextual details (e.g., laboratory conditions, equipment, and controls) that render the safety label deterministically objective under standard scientific protocols, allowing fixed labels for reproducible evaluation. The paper's conclusion refers to real-world deployment scenarios where additional unstated context may apply, but does not invalidate the benchmark's standardized assessment of model behavior on these objective cases. The vulnerability findings are based on consistent failures across these controlled queries. We will revise the abstract to explicitly distinguish the benchmark's objective scope from broader contextual considerations. revision: partial
-
Referee: Abstract: The distinction between safety knowledge and risk is presented as load-bearing for coverage and bias reduction, yet no details are given on how this distinction was operationalized across the 0.25M samples, how context was (or was not) encoded in the deterministic questions, or how the 1.5M SafeSciTrain samples were filtered to avoid introducing new blind spots; without these, the weakest assumption noted in the review cannot be evaluated.
Authors: We agree that expanded details on operationalization will improve clarity. The full manuscript (Section 3) defines safety knowledge items as tests of factual recall regarding safety principles and regulations, while risk items involve applied scenarios with potential harm; context is encoded via explicit parameters in each question (e.g., 'under standard lab conditions with PPE') to support deterministic answers. SafeSciTrain samples were generated and filtered using rule-based consistency checks against established safety guidelines to minimize blind spots. We will add a dedicated subsection with question templates, examples, and filtering pseudocode to the methods section. revision: yes
Circularity Check
Empirical benchmark construction with no derivation chain or self-referential reductions
full rationale
The paper presents SafeSci as a framework consisting of SafeSciBench (0.25M samples) and SafeSciTrain (1.5M samples) for LLM safety evaluation and fine-tuning in scientific domains. All central claims rest on the construction of these datasets, objective metrics for deterministically answerable questions, and empirical results from evaluating 24 LLMs plus fine-tuning experiments. No mathematical derivations, predictions, or equations are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The distinction between safety knowledge and risk is an author-defined categorization used to build the benchmark; this is standard dataset design and does not create circularity. The final qualitative argument that safety depends on context is an interpretive conclusion, not a load-bearing prediction derived from the benchmark labels. The work is self-contained as empirical resource creation and evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Uniprot: the universal protein knowledgebase in 2023.Nucleic acids research, 51(D1):D523–D531,
work page 2023
-
[2]
We split all 125 tasks into two sets,Gated Public Access setand HighRisk Restricted Access set
19 SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond Table 8| Task Access Overview. We split all 125 tasks into two sets,Gated Public Access setand HighRisk Restricted Access set. Access Tasks Chemistry Gated Public Access First Aid Measures (8912/2972), Hazardous Compound as Reactant/Catalyst (20000/4000), Environmental Ha...
work page 2085
- [3]
- [4]
- [5]
- [6]
-
[7]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
21 SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond K. Feng, K. Ding, W. Wang, X. Zhuang, Z. Wang, M. Qin, Y. Zhao, J. Yao, Q. Zhang, and H. Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models.arXiv preprint arXiv:2406.09098,
-
[9]
Accessed: 2026-01-29. T. Han, A. Kumar, C. Agarwal, and H. Lakkaraju. Medsafetybench: Evaluating and improving the medical safety of large language models.Advances in Neural Information Processing Systems, 37: 33423–33454,
work page 2026
- [10]
-
[11]
Accessed: 2026-01-29. F. Jiang, Z. Xu, L. Niu, Z. Xiang, B. Ramasubramanian, B. Li, and R. Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15157–15173,
work page 2026
-
[12]
SoSBench: Benchmarking Safety Alignment on Six Scientific Domains
F. Jiang, F. Ma, Z. Xu, Y. Li, B. Ramasubramanian, L. Niu, B. Li, X. Chen, Z. Xiang, and R. Poovendran. Sosbench: Benchmarkingsafetyalignmentonscientificknowledge.arXivpreprintarXiv:2505.21605, 2025a. F. Jiang, Z. Xu, L. Niu, B. Y. Lin, and R. Poovendran. Chatbug: A common vulnerability of aligned llms induced by chat templates. InProceedings of the AAAI ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, et al. Pubchem 2023 update.Nucleic acids research, 51(D1):D1373–D1380,
work page 2023
-
[14]
C. Knox, M. Wilson, C. M. Klinger, M. Franklin, E. Oler, A. Wilson, A. Pon, J. Cox, N. E. Chin, S. A. Strawbridge, et al. Drugbank 6.0: the drugbank knowledgebase for 2024.Nucleic acids research, 52 (D1):D1265–D1275,
work page 2024
- [15]
-
[16]
N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024b. T. Li, J. Lu, C. Chu, T. Zeng, Y. Zheng, M. Li, H. Huang, B. Wu, Z. Liu, K. Ma, et al. Scisafeeval: a comprehensive benchmark for safety...
work page internal anchor Pith review arXiv
-
[18]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
National Oceanic and Atmospheric Administration (NOAA). Cameo chemicals [internet]. Available from:https://cameochemicals.noaa.gov/. Accessed: 2026-01-29. J. Nguyen, K. Hoang, C. L. Attubato, and F. Hofstätter. Probing and steering evaluation awareness of language models.arXiv preprint arXiv:2507.01786,
-
[20]
E. W. Sayers, M. Cavanaugh, L. Frisse, K. D. Pruitt, V. A. Schneider, B. A. Underwood, L. Yankie, and I. Karsch-Mizrachi. Genbank 2025 update.Nucleic acids research, 53(D1):D56–D61,
work page 2025
-
[21]
Accessed: 2026-01-29. Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li. Badchain: Backdoor chain-of-thought prompting for large language models,
work page 2026
-
[22]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Blog post. Accessed: 2026-01-29. 24 SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond B. Zdrazil, E. Felix, F. Hunter, E. J. Manners, J. Blackshaw, S. Corbett, M. De Veij, H. Ioannidis, D. M. Lopez, J. F. Mosquera, et al. The chembl database in 2023: a drug discovery platform spanning multiple bioactivity data types and tim...
work page 2026
-
[24]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025a. Z. Zhang, J. Wang, Y. Guo, et al. Aibench: Towards trustworthy evaluation under the 45°law.Displays, page 103255, 2025b. ISSN 0141-9382. doi: ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.displa.2025.103255 2025
- [25]
-
[26]
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Ans.” represent if the questions have corresponding answers. “Rep
25 SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond A. Task Details We present the detailed information of 125 tasks in the below 6 tables. Table 9| Details about the tasks in chemistry field. We use? to represent knowledge questions and for risk questions. “Ans.” represent if the questions have corresponding answers. “Rep...
-
[28]
It is a liquid with a molecular weight of 354.59, used in applications like sealants and monomers, which may involve exposure risks during handling or manufacturing. Answer: CCO[Si](CC[Si](OCC)(OCC)OCC)(OCC)OCC Chemistry - Toxicant Synthesis with Common Materials Example Task: Toxicant Synthesis with Common MaterialsField: Chemistry **********************...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.