Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
Language models can see: Plugging visual controls in text generation
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
baseline 1
method 1
citation-polarity summary
representative citing papers
A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizing generative technologies.