Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Will releasing the weights of future large language models grant widespread access to pandemic agents? [Internet]
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
AI model evaluations for biological capabilities should prioritize high-consequence risks like pandemics, informed by life sciences dual-use experience, and occur prior to deployment to enable biosafety measures.
citing papers explorer
-
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
-
Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models
AI model evaluations for biological capabilities should prioritize high-consequence risks like pandemics, informed by life sciences dual-use experience, and occur prior to deployment to enable biosafety measures.