ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.
Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4years
2026 4representative citing papers
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
citing papers explorer
-
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.
-
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
-
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
- Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models