pith. sign in

arxiv: 2505.23161 · v2 · pith:FRYSE7P3new · submitted 2025-05-29 · 💻 cs.CV · cs.AI· cs.LG

Implicit Inversion turns CLIP into a Decoder

classification 💻 cs.CV cs.AIcs.LG
keywords clipimagedecoderaligndiscriminativegenerationgenerativeimages
0
0 comments X
read the original abstract

CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

    cs.CV 2025-11 unverdicted novelty 7.0

    TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.