pith. sign in

arxiv: 2602.03402 · v3 · submitted 2026-02-03 · 💻 cs.AI · cs.LG

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords risk awareness injectionvision language modelsmultimodal jailbreaksafety calibrationtraining freeunsafe prototype subspacetoken modulation
0
0 comments X

The pith

Risk Awareness Injection reduces multimodal jailbreak success in vision-language models by modulating high-risk visual tokens to restore safety recognition without training or performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models often lose the risk detection ability that large language models have when images are added, making them susceptible to combined image-text attacks. Risk Awareness Injection addresses this by building an unsafe prototype subspace from text embeddings and then adjusting specific visual tokens that carry high risk signals. This targeted change activates safety awareness in the cross-modal space while leaving other token meanings untouched. The result is lower attack success on jailbreak tests and unchanged accuracy on standard tasks, all without any model retraining. A sympathetic reader would see this as a practical way to add safety to existing multimodal systems.

Core claim

The paper establishes that constructing an Unsafe Prototype Subspace from language embeddings and performing targeted modulation on selected high-risk visual tokens restores the model's LLM-like ability to detect unsafe content from visual inputs, substantially reducing attack success rates across jailbreak benchmarks while preserving task performance on utility evaluations.

What carries the argument

Unsafe Prototype Subspace from language embeddings with targeted modulation on high-risk visual tokens to amplify safety-critical signals in the cross-modal feature space.

Load-bearing premise

Visual inputs dilute risk-related signals in VLMs, and targeted modulation of high-risk visual tokens can activate safety signals without damaging the original semantic meaning needed for reasoning.

What would settle it

A test case where applying the token modulation fails to reduce jailbreak success rate on a benchmark, or where it causes measurable drop in accuracy on a utility benchmark like visual question answering.

read the original abstract

Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Risk Awareness Injection (RAI), a training-free method to calibrate vision-language models (VLMs) against multimodal jailbreaks. It constructs an Unsafe Prototype Subspace from language embeddings and applies targeted modulation to selected high-risk visual tokens to restore LLM-like risk recognition, claiming substantial reduction in attack success rates while preserving utility on downstream tasks.

Significance. If the central claim holds, RAI would offer a lightweight, parameter-free alternative to safety fine-tuning for VLMs, addressing a practical deployment barrier without retraining costs. The training-free construction and explicit focus on cross-modal signal dilution are strengths that could influence safety calibration research.

major comments (2)
  1. [Abstract] Abstract: The claim that modulation on high-risk visual tokens 'preserves the semantic integrity of original tokens for cross-modal reasoning' is load-bearing for the no-utility-compromise result, yet the description provides no ablation, cosine-similarity analysis, or representation-shift metric demonstrating that the modulation vector is approximately orthogonal to features used by utility heads (object identity, spatial relations, attribute binding).
  2. [Method] Method (Unsafe Prototype Subspace construction): The selection criterion for high-risk visual tokens and the projection onto the language-derived subspace are not shown to guarantee independence from utility-critical directions; without this, degradation on safe inputs remains possible even if average benchmark scores appear stable.
minor comments (1)
  1. [Abstract] Abstract: No quantitative metrics, baselines, or dataset sizes are reported, which hinders immediate assessment of effect sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications drawn from the manuscript and commit to targeted revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that modulation on high-risk visual tokens 'preserves the semantic integrity of original tokens for cross-modal reasoning' is load-bearing for the no-utility-compromise result, yet the description provides no ablation, cosine-similarity analysis, or representation-shift metric demonstrating that the modulation vector is approximately orthogonal to features used by utility heads (object identity, spatial relations, attribute binding).

    Authors: We acknowledge that the manuscript would benefit from explicit quantitative support for the orthogonality claim. Our utility benchmarks (VQA, captioning, and reasoning tasks) already demonstrate no measurable degradation, which is consistent with the modulation primarily affecting risk-aligned directions. In the revision we will add cosine-similarity measurements between the modulation vector and utility-critical feature directions, together with representation-shift metrics and an ablation that isolates the effect on object-identity and spatial-relation heads. revision: yes

  2. Referee: [Method] Method (Unsafe Prototype Subspace construction): The selection criterion for high-risk visual tokens and the projection onto the language-derived subspace are not shown to guarantee independence from utility-critical directions; without this, degradation on safe inputs remains possible even if average benchmark scores appear stable.

    Authors: The subspace is constructed solely from language embeddings of unsafe textual concepts, and a visual token is selected for modulation only when its projection onto this subspace exceeds a threshold derived from unsafe prototypes. This cross-modal alignment targets risk signals that are diluted by vision. While the current experiments show stable performance on safe inputs, we agree that explicit verification of independence is valuable. The revision will include (i) per-token projection statistics on safe versus unsafe images and (ii) an analysis confirming that utility-critical directions remain largely unaffected. revision: yes

Circularity Check

0 steps flagged

No circularity: RAI is an explicit construction validated by external benchmarks

full rationale

The paper motivates RAI from the observed dilution of risk signals when visual inputs are added to LLMs, then directly defines the Unsafe Prototype Subspace from language embeddings and the targeted modulation rule on high-risk visual tokens. This is a training-free algorithmic construction, not a fitted parameter renamed as a prediction. No equations reduce the claimed safety-utility tradeoff to the inputs by definition. No self-citation is invoked as a uniqueness theorem or load-bearing premise. The abstract and method description present the modulation as an explicit design choice whose effect on semantics is asserted and then tested on separate jailbreak and utility benchmarks. Because the central claim rests on this construction plus empirical measurement rather than on any tautological reduction or self-referential loop, the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that visual inputs dilute risk signals and introduces the Unsafe Prototype Subspace as a new entity without independent evidence or falsifiable handles provided in the abstract.

axioms (1)
  • domain assumption Incorporation of visual inputs in VLMs frequently dilutes risk-related signals
    This is explicitly stated as the key motivation drawn from recent research in the abstract.
invented entities (1)
  • Unsafe Prototype Subspace no independent evidence
    purpose: Constructed from language embeddings to enable targeted modulation that activates safety-critical signals in visual tokens
    Newly introduced concept central to the RAI framework for restoring risk recognition.

pith-pipeline@v0.9.0 · 5496 in / 1347 out tokens · 44087 ms · 2026-05-16T08:10:08.073034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.