Mirrorcheck: Efficient adversarial defense for vision-language models

· 2024 · cs.CV · arXiv 2406.09250

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings. MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo. Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics

cs.CR · 2026-06-05 · unverdicted · novelty 6.0

MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

citing papers explorer

Showing 2 of 2 citing papers.

Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics cs.CR · 2026-06-05 · unverdicted · none · ref 74 · internal anchor
MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under adaptive attacks.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 247 · internal anchor
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Mirrorcheck: Efficient adversarial defense for vision-language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer