Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Gang Xu; Hongjie Jiang; Mengxuan Wang; Ming Li; Tao He; Yuxin Chen

arxiv: 2602.03402 · v3 · submitted 2026-02-03 · 💻 cs.AI · cs.LG

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang , Yuxin Chen , Gang Xu , Tao He , Hongjie Jiang , Ming Li This is my paper

Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords risk awareness injectionvision language modelsmultimodal jailbreaksafety calibrationtraining freeunsafe prototype subspacetoken modulation

0 comments

The pith

Risk Awareness Injection reduces multimodal jailbreak success in vision-language models by modulating high-risk visual tokens to restore safety recognition without training or performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often lose the risk detection ability that large language models have when images are added, making them susceptible to combined image-text attacks. Risk Awareness Injection addresses this by building an unsafe prototype subspace from text embeddings and then adjusting specific visual tokens that carry high risk signals. This targeted change activates safety awareness in the cross-modal space while leaving other token meanings untouched. The result is lower attack success on jailbreak tests and unchanged accuracy on standard tasks, all without any model retraining. A sympathetic reader would see this as a practical way to add safety to existing multimodal systems.

Core claim

The paper establishes that constructing an Unsafe Prototype Subspace from language embeddings and performing targeted modulation on selected high-risk visual tokens restores the model's LLM-like ability to detect unsafe content from visual inputs, substantially reducing attack success rates across jailbreak benchmarks while preserving task performance on utility evaluations.

What carries the argument

Unsafe Prototype Subspace from language embeddings with targeted modulation on high-risk visual tokens to amplify safety-critical signals in the cross-modal feature space.

Load-bearing premise

Visual inputs dilute risk-related signals in VLMs, and targeted modulation of high-risk visual tokens can activate safety signals without damaging the original semantic meaning needed for reasoning.

What would settle it

A test case where applying the token modulation fails to reduce jailbreak success rate on a benchmark, or where it causes measurable drop in accuracy on a utility benchmark like visual question answering.

read the original abstract

Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAI gives a training-free safety tweak for VLMs by building an unsafe subspace from text embeddings and modulating high-risk visual tokens, but the claim that this leaves utility untouched rests on an unproven separation of directions.

read the letter

The main thing here is a training-free method called Risk Awareness Injection that tries to restore LLM-style risk detection in VLMs. They build an Unsafe Prototype Subspace from language embeddings and then adjust selected visual tokens to strengthen safety signals without full retraining. The paper shows this cuts attack success rates on jailbreak benchmarks while holding steady on utility tasks, which is the practical angle they emphasize. What stands out is how they avoid the usual costs of safety fine-tuning or blunt token suppression. The dilution observation is reasonable, and targeting the cross-modal feature space directly is a clean way to address it. The construction looks reproducible from the description, and the focus on high-risk tokens rather than all of them is a sensible efficiency choice. The soft spot is exactly the one the stress test flags. The modulation has to add a risk direction without bleeding into features that encode object identity, attributes, or spatial layout, otherwise utility drops on clean inputs even if the abstract says semantics stay intact. The paper asserts preservation, but I would want to see the actual ablations that measure representation shift on utility-critical tokens and confirm the subspace direction has near-zero projection onto those axes. Without those controls, the no-compromise claim is hard to trust at face value. This is for groups working on deployable multimodal safety who want something lighter than retraining. A reader who needs quick calibration options would get concrete value from the method and the benchmark numbers if they check out. It deserves peer review because the core idea is distinct enough and the problem is real, even if the utility evidence will need tightening.

Referee Report

2 major / 1 minor

Summary. The paper proposes Risk Awareness Injection (RAI), a training-free method to calibrate vision-language models (VLMs) against multimodal jailbreaks. It constructs an Unsafe Prototype Subspace from language embeddings and applies targeted modulation to selected high-risk visual tokens to restore LLM-like risk recognition, claiming substantial reduction in attack success rates while preserving utility on downstream tasks.

Significance. If the central claim holds, RAI would offer a lightweight, parameter-free alternative to safety fine-tuning for VLMs, addressing a practical deployment barrier without retraining costs. The training-free construction and explicit focus on cross-modal signal dilution are strengths that could influence safety calibration research.

major comments (2)

[Abstract] Abstract: The claim that modulation on high-risk visual tokens 'preserves the semantic integrity of original tokens for cross-modal reasoning' is load-bearing for the no-utility-compromise result, yet the description provides no ablation, cosine-similarity analysis, or representation-shift metric demonstrating that the modulation vector is approximately orthogonal to features used by utility heads (object identity, spatial relations, attribute binding).
[Method] Method (Unsafe Prototype Subspace construction): The selection criterion for high-risk visual tokens and the projection onto the language-derived subspace are not shown to guarantee independence from utility-critical directions; without this, degradation on safe inputs remains possible even if average benchmark scores appear stable.

minor comments (1)

[Abstract] Abstract: No quantitative metrics, baselines, or dataset sizes are reported, which hinders immediate assessment of effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications drawn from the manuscript and commit to targeted revisions to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that modulation on high-risk visual tokens 'preserves the semantic integrity of original tokens for cross-modal reasoning' is load-bearing for the no-utility-compromise result, yet the description provides no ablation, cosine-similarity analysis, or representation-shift metric demonstrating that the modulation vector is approximately orthogonal to features used by utility heads (object identity, spatial relations, attribute binding).

Authors: We acknowledge that the manuscript would benefit from explicit quantitative support for the orthogonality claim. Our utility benchmarks (VQA, captioning, and reasoning tasks) already demonstrate no measurable degradation, which is consistent with the modulation primarily affecting risk-aligned directions. In the revision we will add cosine-similarity measurements between the modulation vector and utility-critical feature directions, together with representation-shift metrics and an ablation that isolates the effect on object-identity and spatial-relation heads. revision: yes
Referee: [Method] Method (Unsafe Prototype Subspace construction): The selection criterion for high-risk visual tokens and the projection onto the language-derived subspace are not shown to guarantee independence from utility-critical directions; without this, degradation on safe inputs remains possible even if average benchmark scores appear stable.

Authors: The subspace is constructed solely from language embeddings of unsafe textual concepts, and a visual token is selected for modulation only when its projection onto this subspace exceeds a threshold derived from unsafe prototypes. This cross-modal alignment targets risk signals that are diluted by vision. While the current experiments show stable performance on safe inputs, we agree that explicit verification of independence is valuable. The revision will include (i) per-token projection statistics on safe versus unsafe images and (ii) an analysis confirming that utility-critical directions remain largely unaffected. revision: yes

Circularity Check

0 steps flagged

No circularity: RAI is an explicit construction validated by external benchmarks

full rationale

The paper motivates RAI from the observed dilution of risk signals when visual inputs are added to LLMs, then directly defines the Unsafe Prototype Subspace from language embeddings and the targeted modulation rule on high-risk visual tokens. This is a training-free algorithmic construction, not a fitted parameter renamed as a prediction. No equations reduce the claimed safety-utility tradeoff to the inputs by definition. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The abstract and method description present the modulation as an explicit design choice whose effect on semantics is asserted and then tested on separate jailbreak and utility benchmarks. Because the central claim rests on this construction plus empirical measurement rather than on any tautological reduction or self-referential loop, the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that visual inputs dilute risk signals and introduces the Unsafe Prototype Subspace as a new entity without independent evidence or falsifiable handles provided in the abstract.

axioms (1)

domain assumption Incorporation of visual inputs in VLMs frequently dilutes risk-related signals
This is explicitly stated as the key motivation drawn from recent research in the abstract.

invented entities (1)

Unsafe Prototype Subspace no independent evidence
purpose: Constructed from language embeddings to enable targeted modulation that activates safety-critical signals in visual tokens
Newly introduced concept central to the RAI framework for restoring risk recognition.

pith-pipeline@v0.9.0 · 5496 in / 1347 out tokens · 44087 ms · 2026-05-16T08:10:08.073034+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens... h'_v = h_v + sum s_{v,k} · u_k / ||u_k||
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Risk Signal Dilution... semantic gap in visual-text alignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.