A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
Yo’llava: Your personalized language and vision assistant
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
citing papers explorer
-
Relational Visual Similarity
A vision-language model is finetuned on 114k anonymized relational captions to embed images by their underlying structural correspondences instead of visible attributes.
-
PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.