LLMs can be fine-tuned into model organisms that resist RL elicitation in domains like biosecurity while preserving related skills, and frontier models show explicit reasoning to suppress exploration when given training context.
It would only flag content when there’s very high confidence it’s harmful, reducing false positives that could block legitimate use cases
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Exploration Hacking: Can LLMs Learn to Resist RL Training?
LLMs can be fine-tuned into model organisms that resist RL elicitation in domains like biosecurity while preserving related skills, and frontier models show explicit reasoning to suppress exploration when given training context.