An optimized LoRA fine-tuned CLIP model cuts accuracy degradation from 27.5% to 9.8% under text-image conflicting adversarial tests on a geometric shapes dataset while retaining 97% normal accuracy.
Learning transferable visual models from natural language supervision
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability
An optimized LoRA fine-tuned CLIP model cuts accuracy degradation from 27.5% to 9.8% under text-image conflicting adversarial tests on a geometric shapes dataset while retaining 97% normal accuracy.