A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
Imagenet: A large-scale hierarchical image database
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
citing papers explorer
-
Relational Visual Similarity
A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.