OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

Junjie Hu; Rheeya Uppaal; Selina Sung; Seungwoo Lyu

arxiv: 2607.02047 · v1 · pith:DBYFOPG4new · submitted 2026-07-02 · 💻 cs.CL · cs.AI

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

Rheeya Uppaal , Seungwoo Lyu , Selina Sung , Junjie Hu This is my paper

Pith reviewed 2026-07-03 14:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords safe completionintent calibrationdual-use promptsAI safety evaluationbenchmarklanguage models

0 comments

The pith

Safe completion must be measured as intent-calibrated responses over matched task variants rather than average safety across separate prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenSafeIntent, a benchmark built from prompt sets that hold the underlying task fixed while varying expressed intent across benign, dual-use, and malicious versions. This design tests whether models adjust their assistance based on intent rather than simply averaging safety and helpfulness scores over unrelated prompts. Standard evaluations using isolated prompts can hide cases where a model stays safe on one variant but fails on another. The authors apply the benchmark to multiple models and report that safety scores at the prompt level do not predict behavior across intent shifts, that dual-use responses change with small rewordings, and that high-level answers on risky topics often cross safety lines. If these observations hold, safety assessment changes from a single tradeoff metric to a check for consistent calibration over controlled variants.

Core claim

OpenSafeIntent provides controlled prompt-sets with benign, dual-use, and malicious variants of the same task, revealing that models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. This shows safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.

What carries the argument

The OpenSafeIntent benchmark of datapoints each containing benign, dual-use, and malicious variants of one fixed task, used to measure whether models calibrate assistance according to expressed intent.

If this is right

Models that appear safe on average across independent prompts often fail to calibrate assistance when intent changes for the same task.
Dual-use behavior in models is brittle and changes with paraphrase of the prompt.
High-level answers on risky topics do not reliably stay within safe boundaries.
Reframing ambiguous requests into safer tasks makes responses substantially less likely to violate safety rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety training could target consistency across intent-matched variants instead of aggregate scores.
Evaluation suites in other dual-use areas could adopt similar controlled variant designs to isolate intent effects.
Real-world filtering systems might incorporate checks for calibration stability on matched prompt families.

Load-bearing premise

The prompt variants are equivalent in underlying task difficulty and content while differing only in expressed intent.

What would settle it

An experiment showing that models maintain consistent safety behavior across all three intent variants within each OpenSafeIntent set, or that observed differences trace to unintended changes in task content rather than intent.

Figures

Figures reproduced from arXiv: 2607.02047 by Junjie Hu, Rheeya Uppaal, Selina Sung, Seungwoo Lyu.

**Figure 2.** Figure 2: Mean and Triplet Safety across models. Triplet Safety is consistently lower and exhibits larger separation [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Stratified results by task type and harm domain, averaged across models. Left: Triplet Safety. Right: [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Safety related Failure modes for dual-use prompts. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of dual-use paraphrase sets by safety behavior. Each set is classified as all safe, all unsafe, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Conditional unsafe rate by response assistance mode. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Stratified results by task type and harm domain. Left: Utility Range across dual-use paraphrases. Right: [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Safe completion requires models to provide useful assistance without enabling harm, but this behavior is difficult to evaluate with isolated prompts. We introduce OpenSafeIntent, a benchmark of controlled prompt-sets that vary intent while holding the underlying task fixed. Each datapoint contains benign, dual-use, and malicious variants of the same task. This design lets us evaluate whether models calibrate assistance across intent shifts, rather than merely appearing safe on average. Across a broad model suite, we find that prompt-level safety hides important failures: models often fail to remain safe across matched intent variants, dual-use behavior is brittle under paraphrase, high-level answers on risky topics are not reliably safe, and responses that reframe ambiguous requests into safer tasks are substantially less likely to cross the safety boundary. Our results suggest that safe completion should be evaluated as intent-calibrated behavior over controlled task variants, not as a single safety-helpfulness tradeoff over independent prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core idea of testing safety via matched intent variants on fixed tasks is worth pursuing, but the paper provides no evidence that the variants actually hold task difficulty and content constant.

read the letter

The main thing to know is that this paper introduces OpenSafeIntent, a benchmark built from prompt sets that keep the underlying task the same while shifting expressed intent from benign to dual-use to malicious. The claim is that current safety tests on isolated prompts miss calibration failures, and the results they report—brittle dual-use behavior, unsafe high-level answers, and reframing that avoids the safety boundary—support that point in principle.

What the work does is lay out a controlled design that directly targets intent calibration rather than average safety scores. That is a clear step beyond the independent-prompt benchmarks referenced in the abstract, and the reported patterns across a model suite give a concrete illustration of why the distinction matters.

The soft spot is the unverified equivalence assumption. The abstract states that each datapoint holds the task fixed while varying only intent, but supplies no construction protocol, human validation scores, or statistical checks on length, specificity, or required knowledge across variants. If malicious versions systematically add detail or complexity, then differences in refusal or reframing rates could trace to task variation instead of intent handling. Without those checks, the attribution of failures to safety calibration remains open.

This paper is for researchers building or using safety benchmarks who want to move past single-prompt testing. A reader focused on evaluation methodology will find the design direction useful even if the current execution needs tightening.

It deserves peer review because the benchmark concept addresses a real gap in how safe completion is measured, and the reported patterns are worth checking once the variant construction is documented and validated.

Referee Report

2 major / 1 minor

Summary. The paper introduces OpenSafeIntent, a benchmark consisting of controlled prompt-sets with benign, dual-use, and malicious variants of the same underlying task. This design is used to evaluate whether language models calibrate safe completion across intent shifts rather than exhibiting safety only on average. The authors report that models frequently fail to remain safe across matched variants, that dual-use behavior is brittle under paraphrase, that high-level answers on risky topics are not reliably safe, and that reframing responses are less likely to cross safety boundaries. They conclude that safe completion should be assessed as intent-calibrated behavior over controlled task variants instead of a single safety-helpfulness tradeoff over independent prompts.

Significance. If the task-equivalence assumption holds, the benchmark offers a more precise method for diagnosing safety calibration failures that standard aggregate evaluations miss. The controlled variant design and evaluation across a broad model suite are strengths that could influence future safety benchmarking practices.

major comments (2)

[Benchmark construction] Benchmark construction (described in the abstract and implied methods): the claim that each datapoint's variants hold the underlying task fixed (same difficulty, content, required knowledge) while differing only in expressed intent is load-bearing for attributing observed refusal/reframing differences to intent calibration. No construction protocol, human validation metrics, or statistical equivalence checks are supplied, so task variation cannot be ruled out as a confound.
[Results and discussion] Results and discussion sections: the reported failures (brittle dual-use, unsafe high-level answers) are interpreted as evidence of poor intent calibration, but this interpretation rests directly on unverified equivalence; without independent confirmation that variants are matched on difficulty and content, the central recommendation for intent-calibrated evaluation cannot be cleanly supported.

minor comments (1)

The abstract states the design goal clearly but would benefit from one concrete example of a matched prompt triplet to illustrate how intent is varied while task content is held fixed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for highlighting the centrality of the task-equivalence assumption. We address each major comment below and will revise the manuscript to strengthen the supporting evidence.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (described in the abstract and implied methods): the claim that each datapoint's variants hold the underlying task fixed (same difficulty, content, required knowledge) while differing only in expressed intent is load-bearing for attributing observed refusal/reframing differences to intent calibration. No construction protocol, human validation metrics, or statistical equivalence checks are supplied, so task variation cannot be ruled out as a confound.

Authors: We agree that the manuscript currently lacks an explicit, self-contained construction protocol and associated validation metrics. In the revised version we will add a dedicated subsection in Methods that details the prompt-generation procedure, the operational criteria used to hold task difficulty, content, and required knowledge constant across intent variants, the human validation protocol (including annotator instructions and agreement statistics), and any quantitative equivalence checks performed on the final dataset. revision: yes
Referee: [Results and discussion] Results and discussion sections: the reported failures (brittle dual-use, unsafe high-level answers) are interpreted as evidence of poor intent calibration, but this interpretation rests directly on unverified equivalence; without independent confirmation that variants are matched on difficulty and content, the central recommendation for intent-calibrated evaluation cannot be cleanly supported.

Authors: This observation follows directly from the first comment. Once the expanded construction and validation details are included, we will revise the Results and Discussion sections to tie each reported failure explicitly to the verified equivalence evidence. We will also add a short limitations paragraph that acknowledges the assumption and any residual uncertainty that the new checks cannot eliminate. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark introduction with no derivations or self-referential reductions

full rationale

The paper introduces OpenSafeIntent as an empirical benchmark consisting of controlled prompt variants. It contains no equations, fitted parameters, predictions, or derivation chains. The central design choice (holding task fixed while varying intent) is presented as an explicit methodological assumption rather than derived from prior results or self-citations. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs. The reported findings are observational comparisons across models on the new dataset and do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities identifiable.

pith-pipeline@v0.9.1-grok · 5691 in / 984 out tokens · 25724 ms · 2026-07-03T14:42:47.465567+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 25 canonical work pages · 11 internal anchors

[1]

arXiv preprint arXiv:2507.11878 , year=

Llms encode harmfulness and refusal separately , author=. arXiv preprint arXiv:2507.11878 , year=

work page arXiv
[2]

arXiv preprint arXiv:2502.01042 , pages=

Internal activation as the polar star for steering unsafe llm behavior , author=. arXiv preprint arXiv:2502.01042 , pages=

work page arXiv
[3]

arXiv preprint arXiv:2505.23556 , year=

Understanding refusal in language models with sparse autoencoders , author=. arXiv preprint arXiv:2505.23556 , year=

work page arXiv
[4]

arXiv preprint arXiv:2603.05773 , year=

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models , author=. arXiv preprint arXiv:2603.05773 , year=

work page arXiv
[5]

arXiv preprint arXiv:2508.09224 , year=

From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=

work page arXiv
[6]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

2025 , howpublished =

Meta , title =. 2025 , howpublished =

2025
[11]

2026 , howpublished =

Google , title =. 2026 , howpublished =

2026
[12]

2025 , howpublished =

Anthropic , title =. 2025 , howpublished =

2025
[13]

2026 , howpublished =

Anthropic , title =. 2026 , howpublished =

2026
[14]

2025 , howpublished =

Google DeepMind , title =. 2025 , howpublished =

2025
[15]

2026 , howpublished =

Google DeepMind , title =. 2026 , howpublished =

2026
[16]

2025 , publisher=

Qwen3-next: Towards ultimate training & inference efficiency , author=. 2025 , publisher=

2025
[17]

2025 , howpublished =

Mistral Small 3 , author=. 2025 , howpublished =

2025
[18]

, author=

Medium is the new large. , author=. 2025 , howpublished =

2025
[19]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2026 , howpublished =

Introducing GPT‑5.4 , author=. 2026 , howpublished =

2026
[21]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2406.15513 , year=

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference , author=. arXiv preprint arXiv:2406.15513 , year=

work page arXiv
[24]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=
[25]

Exploring safety-utility trade-offs in personalized language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[26]

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations , author=. arXiv preprint arXiv:2604.27093 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Findings of the Association for Computational Linguistics: EACL 2026 , pages=

Safesearch: Do not trade safety for utility in llm search agents , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=

2026
[28]

arXiv , author=

Utility engineering: Analyzing and controlling emergent value systems in AIs. arXiv , author=. arXiv preprint arXiv:2502.08640 , year=

work page arXiv
[29]

International Conference on Learning Representations , volume=

Model editing as a robust and denoised variant of dpo: A case study on toxicity , author=. International Conference on Learning Representations , volume=
[30]

arXiv preprint arXiv:2402.05162 , year=

Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=

work page arXiv
[31]

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context , author=. arXiv preprint arXiv:2601.17642 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2509.01909 , year=

Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models , author=. arXiv preprint arXiv:2509.01909 , year=

work page arXiv
[33]

arXiv preprint arXiv:2510.10452 , year=

Steering Over-refusals Towards Safety in Retrieval Augmented Generation , author=. arXiv preprint arXiv:2510.10452 , year=

work page arXiv
[34]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

Sosbench: Benchmarking safety alignment on scientific knowledge , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

2025
[35]

arXiv preprint arXiv:2510.04320 , year=

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs , author=. arXiv preprint arXiv:2510.04320 , year=

work page arXiv
[36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024
[38]

arXiv preprint arXiv:2405.20947 , year=

Or-bench: An over-refusal benchmark for large language models , author=. arXiv preprint arXiv:2405.20947 , year=

work page arXiv
[39]

Findings of the Association for Computational Linguistics: EACL 2024 , pages=

Do-not-answer: Evaluating safeguards in LLMs , author=. Findings of the Association for Computational Linguistics: EACL 2024 , pages=

2024
[40]

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains , author=. arXiv preprint arXiv:2604.19001 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2601.22510 , year=

Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic , author=. arXiv preprint arXiv:2601.22510 , year=

work page arXiv
[42]

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh. Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.194

work page doi:10.18653/v1/2026.eacl-long.194 2026

[1] [1]

arXiv preprint arXiv:2507.11878 , year=

Llms encode harmfulness and refusal separately , author=. arXiv preprint arXiv:2507.11878 , year=

work page arXiv

[2] [2]

arXiv preprint arXiv:2502.01042 , pages=

Internal activation as the polar star for steering unsafe llm behavior , author=. arXiv preprint arXiv:2502.01042 , pages=

work page arXiv

[3] [3]

arXiv preprint arXiv:2505.23556 , year=

Understanding refusal in language models with sparse autoencoders , author=. arXiv preprint arXiv:2505.23556 , year=

work page arXiv

[4] [4]

arXiv preprint arXiv:2603.05773 , year=

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models , author=. arXiv preprint arXiv:2603.05773 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2508.09224 , year=

From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=

work page arXiv

[6] [6]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

2025 , howpublished =

Meta , title =. 2025 , howpublished =

2025

[11] [11]

2026 , howpublished =

Google , title =. 2026 , howpublished =

2026

[12] [12]

2025 , howpublished =

Anthropic , title =. 2025 , howpublished =

2025

[13] [13]

2026 , howpublished =

Anthropic , title =. 2026 , howpublished =

2026

[14] [14]

2025 , howpublished =

Google DeepMind , title =. 2025 , howpublished =

2025

[15] [15]

2026 , howpublished =

Google DeepMind , title =. 2026 , howpublished =

2026

[16] [16]

2025 , publisher=

Qwen3-next: Towards ultimate training & inference efficiency , author=. 2025 , publisher=

2025

[17] [17]

2025 , howpublished =

Mistral Small 3 , author=. 2025 , howpublished =

2025

[18] [18]

, author=

Medium is the new large. , author=. 2025 , howpublished =

2025

[19] [19]

OpenAI GPT-5 System Card

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

2026 , howpublished =

Introducing GPT‑5.4 , author=. 2026 , howpublished =

2026

[21] [21]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Gemma: Open Models Based on Gemini Research and Technology

Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

arXiv preprint arXiv:2406.15513 , year=

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference , author=. arXiv preprint arXiv:2406.15513 , year=

work page arXiv

[24] [24]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

[25] [25]

Exploring safety-utility trade-offs in personalized language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[26] [26]

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations , author=. arXiv preprint arXiv:2604.27093 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Findings of the Association for Computational Linguistics: EACL 2026 , pages=

Safesearch: Do not trade safety for utility in llm search agents , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=

2026

[28] [28]

arXiv , author=

Utility engineering: Analyzing and controlling emergent value systems in AIs. arXiv , author=. arXiv preprint arXiv:2502.08640 , year=

work page arXiv

[29] [29]

International Conference on Learning Representations , volume=

Model editing as a robust and denoised variant of dpo: A case study on toxicity , author=. International Conference on Learning Representations , volume=

[30] [30]

arXiv preprint arXiv:2402.05162 , year=

Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=

work page arXiv

[31] [31]

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context , author=. arXiv preprint arXiv:2601.17642 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2509.01909 , year=

Oyster-I: Beyond Refusal--Constructive Safety Alignment for Responsible Language Models , author=. arXiv preprint arXiv:2509.01909 , year=

work page arXiv

[33] [33]

arXiv preprint arXiv:2510.10452 , year=

Steering Over-refusals Towards Safety in Retrieval Augmented Generation , author=. arXiv preprint arXiv:2510.10452 , year=

work page arXiv

[34] [34]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

Sosbench: Benchmarking safety alignment on scientific knowledge , author=. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year=

2025

[35] [35]

arXiv preprint arXiv:2510.04320 , year=

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs , author=. arXiv preprint arXiv:2510.04320 , year=

work page arXiv

[36] [36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2024

[38] [38]

arXiv preprint arXiv:2405.20947 , year=

Or-bench: An over-refusal benchmark for large language models , author=. arXiv preprint arXiv:2405.20947 , year=

work page arXiv

[39] [39]

Findings of the Association for Computational Linguistics: EACL 2024 , pages=

Do-not-answer: Evaluating safeguards in LLMs , author=. Findings of the Association for Computational Linguistics: EACL 2024 , pages=

2024

[40] [40]

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains , author=. arXiv preprint arXiv:2604.19001 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

arXiv preprint arXiv:2601.22510 , year=

Shattered Compositionality: Counterintuitive Learning Dynamics of Transformers for Arithmetic , author=. arXiv preprint arXiv:2601.22510 , year=

work page arXiv

[42] [42]

Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

Uppaal, Rheeya and Htut, Phu Mon and Bai, Min and Pappas, Nikolaos and Qi, Zheng and Swamy, Sandesh. Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.194

work page doi:10.18653/v1/2026.eacl-long.194 2026