Language models can see: Plugging visual controls in text generation

· 2022 · arXiv 2205.02655

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

baseline 1 method 1

citation-polarity summary

baseline 1 use method 1

representative citing papers

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

cs.CV · 2022-04-01 · unverdicted · novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

cs.CV · 2026-02-02 · unverdicted · novelty 6.0

ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.

PandaGPT: One Model To Instruction-Follow Them All

cs.CL · 2023-05-25 · conditional · novelty 6.0

A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.

Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

cs.HC · 2025-10-11 · unverdicted · novelty 2.0

A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizing generative technologies.

citing papers explorer

Showing 4 of 4 citing papers.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language cs.CV · 2022-04-01 · unverdicted · none · ref 61
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models cs.CV · 2026-02-02 · unverdicted · none · ref 22
ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
PandaGPT: One Model To Instruction-Follow Them All cs.CL · 2023-05-25 · conditional · none · ref 25
A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects cs.HC · 2025-10-11 · unverdicted · none · ref 55
A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizing generative technologies.

Language models can see: Plugging visual controls in text generation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer