Introduces MM-Privacy dataset and evaluations showing MLLMs leak sensitive data from images in various tasks, highlighting task inconsistency effects.
Safety fine-tuning at (almost) no cost: A baseline for vision large language models
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Patcher improves LLM robustness to malicious full-parameter finetuning by scaling train-time adversarial attacks in a bi-level optimization loop and supplies an efficient parallel implementation.
SafeSteer restricts reverse KL penalty to safety tokens selected via activation steering, achieving strong safety on seven benchmarks with minimal degradation on five capability benchmarks using only 100 harmful samples and no general data.
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
Video MLLMs show an audio-visual Clever Hans effect relying on visual-acoustic correlations rather than audio verification; Thud interventions diagnose it and a 10K-sample preference alignment improves intervention performance by 28 points.
SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
COPSD uses a Cross-SFT cold-start followed by constitution-conditioned distillation to achieve stronger safety-helpfulness balance and lower safety tax on reasoning than prior on-policy self-distillation methods.
The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.
citing papers explorer
No citing papers match the current filters.