Sparse autoencoders inserted into VLMs and trained only for reconstruction can reliably detect adversarial attacks on images, including unseen domains and attack types.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 7representative citing papers
PGT generates synthetic tasks via geometric overlays on images to supply dense visual supervision, improving spatial and relational understanding in MLLMs by up to 20% on targeted benchmarks.
PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.
Proposes unifying pix and word tokens in generative LMs via per-pixel embeddings, color folding, and unsupervised pretraining, claiming strong small-model performance on limited data.
citing papers explorer
-
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.