Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Zong, Y · 2024 · arXiv 2402.02207

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.

When Vision Speaks for Sound

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention performance by 28 points.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

cs.CR · 2026-05-02 · conditional · novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL · 2024-04-22 · accept · novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

cs.CR · 2026-05-16 · unverdicted · novelty 5.0

The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

cs.CL · 2025-03-03 · unverdicted · novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 2 of 2 citing papers after filters.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 4
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 183
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer