Safety fine-tuning at (almost) no cost: A baseline for vision large language models

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models , author= · 2024 · arXiv 2402.02207

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges

cs.CR · 2026-06-08 · unverdicted · novelty 6.0

Introduces MM-Privacy dataset and evaluations showing MLLMs leak sensitive data from images in various tasks, highlighting task inconsistency effects.

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

Patcher improves LLM robustness to malicious full-parameter finetuning by scaling train-time adversarial attacks in a bi-level optimization loop and supplies an efficient parallel implementation.

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.

When Vision Speaks for Sound

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention performance by 28 points.

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

cs.CR · 2026-05-02 · conditional · novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL · 2024-04-22 · accept · novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

Constitutional On-Policy Safe Distillation

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

COPSD uses a Cross-SFT cold-start followed by constitution-conditioned distillation to achieve stronger safety-helpfulness balance and lower safety tax on reasoning than prior on-policy self-distillation methods.

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

cs.CR · 2026-05-16 · unverdicted · novelty 5.0

The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

cs.CL · 2025-03-03 · unverdicted · novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Safety fine-tuning at (almost) no cost: A baseline for vision large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer