OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets
Pith reviewed 2026-07-03 14:42 UTC · model grok-4.3
The pith
Safe completion must be measured as intent-calibrated responses over matched task variants rather than average safety across separate prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenSafeIntent provides controlled prompt-sets with benign, dual-use, and malicious variants of the same task, revealing that models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. This shows safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.
What carries the argument
The OpenSafeIntent benchmark of datapoints each containing benign, dual-use, and malicious variants of one fixed task, used to measure whether models calibrate assistance according to expressed intent.
If this is right
- Models that appear safe on average across independent prompts often fail to calibrate assistance when intent changes for the same task.
- Dual-use behavior in models is brittle and changes with paraphrase of the prompt.
- High-level answers on risky topics do not reliably stay within safe boundaries.
- Reframing ambiguous requests into safer tasks makes responses substantially less likely to violate safety rules.
Where Pith is reading between the lines
- Safety training could target consistency across intent-matched variants instead of aggregate scores.
- Evaluation suites in other dual-use areas could adopt similar controlled variant designs to isolate intent effects.
- Real-world filtering systems might incorporate checks for calibration stability on matched prompt families.
Load-bearing premise
The prompt variants are equivalent in underlying task difficulty and content while differing only in expressed intent.
What would settle it
An experiment showing that models maintain consistent safety behavior across all three intent variants within each OpenSafeIntent set, or that observed differences trace to unintended changes in task content rather than intent.
Figures
read the original abstract
Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. Our results suggest that safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenSafeIntent, a benchmark consisting of controlled prompt-sets with benign, dual-use, and malicious variants of the same underlying task. This design is used to evaluate whether language models calibrate safe completion across intent shifts rather than exhibiting safety only on average. The authors report that models frequently fail to remain safe across matched variants, that dual-use behavior is brittle under paraphrase, that high-level answers on risky topics are not reliably safe, and that reframing responses are less likely to cross safety boundaries. They conclude that safe completion should be assessed as intent-calibrated behavior over controlled task variants instead of a single safety-helpfulness tradeoff over independent prompts.
Significance. If the task-equivalence assumption holds, the benchmark offers a more precise method for diagnosing safety calibration failures that standard aggregate evaluations miss. The controlled variant design and evaluation across a broad model suite are strengths that could influence future safety benchmarking practices.
major comments (2)
- [Benchmark construction] Benchmark construction (described in the abstract and implied methods): the claim that each datapoint's variants hold the underlying task fixed (same difficulty, content, required knowledge) while differing only in expressed intent is load-bearing for attributing observed refusal/reframing differences to intent calibration. No construction protocol, human validation metrics, or statistical equivalence checks are supplied, so task variation cannot be ruled out as a confound.
- [Results and discussion] Results and discussion sections: the reported failures (brittle dual-use, unsafe high-level answers) are interpreted as evidence of poor intent calibration, but this interpretation rests directly on unverified equivalence; without independent confirmation that variants are matched on difficulty and content, the central recommendation for intent-calibrated evaluation cannot be cleanly supported.
minor comments (1)
- The abstract states the design goal clearly but would benefit from one concrete example of a matched prompt triplet to illustrate how intent is varied while task content is held fixed.
Simulated Author's Rebuttal
We thank the referee for the careful review and for highlighting the centrality of the task-equivalence assumption. We address each major comment below and will revise the manuscript to strengthen the supporting evidence.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (described in the abstract and implied methods): the claim that each datapoint's variants hold the underlying task fixed (same difficulty, content, required knowledge) while differing only in expressed intent is load-bearing for attributing observed refusal/reframing differences to intent calibration. No construction protocol, human validation metrics, or statistical equivalence checks are supplied, so task variation cannot be ruled out as a confound.
Authors: We agree that the manuscript currently lacks an explicit, self-contained construction protocol and associated validation metrics. In the revised version we will add a dedicated subsection in Methods that details the prompt-generation procedure, the operational criteria used to hold task difficulty, content, and required knowledge constant across intent variants, the human validation protocol (including annotator instructions and agreement statistics), and any quantitative equivalence checks performed on the final dataset. revision: yes
-
Referee: [Results and discussion] Results and discussion sections: the reported failures (brittle dual-use, unsafe high-level answers) are interpreted as evidence of poor intent calibration, but this interpretation rests directly on unverified equivalence; without independent confirmation that variants are matched on difficulty and content, the central recommendation for intent-calibrated evaluation cannot be cleanly supported.
Authors: This observation follows directly from the first comment. Once the expanded construction and validation details are included, we will revise the Results and Discussion sections to tie each reported failure explicitly to the verified equivalence evidence. We will also add a short limitations paragraph that acknowledges the assumption and any residual uncertainty that the new checks cannot eliminate. revision: yes
Circularity Check
Empirical benchmark introduction with no derivations or self-referential reductions
full rationale
The paper introduces OpenSafeIntent as an empirical benchmark consisting of controlled prompt variants. It contains no equations, fitted parameters, predictions, or derivation chains. The central design choice (holding task fixed while varying intent) is presented as an explicit methodological assumption rather than derived from prior results or self-citations. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs. The reported findings are observational comparisons across models on the new dataset and do not reduce to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2507.11878 , year=
Llms encode harmfulness and refusal separately , author=. arXiv preprint arXiv:2507.11878 , year=
-
[2]
arXiv preprint arXiv:2502.01042 , pages=
Internal activation as the polar star for steering unsafe llm behavior , author=. arXiv preprint arXiv:2502.01042 , pages=
-
[3]
arXiv preprint arXiv:2505.23556 , year=
Understanding refusal in language models with sparse autoencoders , author=. arXiv preprint arXiv:2505.23556 , year=
-
[4]
arXiv preprint arXiv:2603.05773 , year=
Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models , author=. arXiv preprint arXiv:2603.05773 , year=
-
[5]
arXiv preprint arXiv:2508.09224 , year=
From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=
-
[6]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
2025 , howpublished =
Meta , title =. 2025 , howpublished =
2025
-
[11]
2026 , howpublished =
Google , title =. 2026 , howpublished =
2026
-
[12]
2025 , howpublished =
Anthropic , title =. 2025 , howpublished =
2025
-
[13]
2026 , howpublished =
Anthropic , title =. 2026 , howpublished =
2026
-
[14]
2025 , howpublished =
Google DeepMind , title =. 2025 , howpublished =
2025
-
[15]
2026 , howpublished =
Google DeepMind , title =. 2026 , howpublished =
2026
-
[16]
2025 , publisher=
Qwen3-next: Towards ultimate training & inference efficiency , author=. 2025 , publisher=
2025
-
[17]
2025 , howpublished =
Mistral Small 3 , author=. 2025 , howpublished =
2025
-
[18]
, author=
Medium is the new large. , author=. 2025 , howpublished =
2025
-
[19]
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
2026 , howpublished =
Introducing GPT‑5.4 , author=. 2026 , howpublished =
2026
-
[21]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2406.15513 , year=
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference , author=. arXiv preprint arXiv:2406.15513 , year=
-
[24]
Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
-
[25]
Exploring safety-utility trade-offs in personalized language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2025
-
[26]
Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations , author=. arXiv preprint arXiv:2604.27093 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Findings of the Association for Computational Linguistics: EACL 2026 , pages=
Safesearch: Do not trade safety for utility in llm search agents , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=
2026
-
[28]
Utility engineering: Analyzing and controlling emergent value systems in AIs. arXiv , author=. arXiv preprint arXiv:2502.08640 , year=
-
[29]
International Conference on Learning Representations , volume=
Model editing as a robust and denoised variant of dpo: A case study on toxicity , author=. International Conference on Learning Representations , volume=
-
[30]
arXiv preprint arXiv:2402.05162 , year=
Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=
-
[31]
Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context
Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context , author=. arXiv preprint arXiv:2601.17642 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
arXiv preprint arXiv:2509.01909 , year=
Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models , author=. arXiv preprint arXiv:2509.01909 , year=
-
[33]
arXiv preprint arXiv:2510.10452 , year=
Steering Over-refusals Towards Safety in Retrieval Augmented Generation , author=. arXiv preprint arXiv:2510.10452 , year=
-
[34]
Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=
Sosbench: Benchmarking safety alignment on scientific knowledge , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=
2025
-
[35]
arXiv preprint arXiv:2510.04320 , year=
Read the Scene, Not the Script: Outcome-Aware Safety for LLMs , author=. arXiv preprint arXiv:2510.04320 , year=
-
[36]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
2024
-
[38]
arXiv preprint arXiv:2405.20947 , year=
Or-bench: An over-refusal benchmark for large language models , author=. arXiv preprint arXiv:2405.20947 , year=
-
[39]
Findings of the Association for Computational Linguistics: EACL 2024 , pages=
Do-not-answer: Evaluating safeguards in LLMs , author=. Findings of the Association for Computational Linguistics: EACL 2024 , pages=
2024
-
[40]
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains , author=. arXiv preprint arXiv:2604.19001 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
arXiv preprint arXiv:2601.22510 , year=
Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic , author=. arXiv preprint arXiv:2601.22510 , year=
-
[42]
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh. Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.194
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.