hub Canonical reference

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker · 2022 · cs.CV · arXiv 2204.00598

Canonical reference. 92% of citing Pith papers cite this work as background.

26 Pith papers citing it

Background 92% of classified citations

open full Pith review browse 26 citing papers arXiv PDF

abstract

Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12

citation-polarity summary

background 11 unclear 1

representative citing papers

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

cs.CV · 2025-11-16 · conditional · novelty 7.0

BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

cs.CV · 2023-10-23 · unverdicted · novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

cs.RO · 2023-07-12 · unverdicted · novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

Voyager: An Open-Ended Embodied Agent with Large Language Models

cs.AI · 2023-05-25 · unverdicted · novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

cs.CV · 2023-03-08 · accept · novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

STiTch refines LLM captions via embedding transition and uses set-to-set bidirectional transportation alignment to improve training-free zero-shot composed image retrieval.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Cognitive Architectures for Language Agents

cs.AI · 2023-09-05 · accept · novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.

Improving Factuality and Reasoning in Language Models through Multiagent Debate

cs.CL · 2023-05-23 · unverdicted · novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

cs.CV · 2023-03-20 · unverdicted · novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

cs.AI · 2023-02-03 · conditional · novelty 6.0

DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.

Inner Monologue: Embodied Reasoning through Planning with Language Models

cs.RO · 2022-07-12 · unverdicted · novelty 6.0

LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.

Emergent Abilities of Large Language Models

cs.CL · 2022-06-15 · unverdicted · novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

CoCa: Contrastive Captioners are Image-Text Foundation Models

cs.CV · 2022-05-04 · accept · novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning

cs.RO · 2025-09-20 · unverdicted · novelty 5.0

LLM-TALE steers RL exploration using LLM-generated plans at task and affordance levels with online suboptimality correction, improving sample efficiency and success rates on pick-and-place tasks without human supervision.

LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting

cs.HC · 2024-12-12 · unverdicted · novelty 5.0

LIVE-GS uses an LLM to predict physical parameters from static Gaussian assets in 10 seconds for physics-aware VR interactions, validated by interviews, baseline comparisons, and user studies.

citing papers explorer

Showing 26 of 26 citing papers.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs cs.CL · 2023-04-14 · conditional · none · ref 20 · internal anchor
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
Code as Policies: Language Model Programs for Embodied Control cs.RO · 2022-09-16 · accept · none · ref 16 · internal anchor
Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.
BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections cs.CV · 2025-11-16 · conditional · none · ref 47 · internal anchor
BridgeEQA creates a new benchmark and EMVR method for embodied agents to perform question answering on real-world bridge inspections using egocentric images and professional reports.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 223 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 71 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models cs.CV · 2023-10-23 · unverdicted · none · ref 55 · internal anchor
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models cs.RO · 2023-07-12 · unverdicted · none · ref 11 · internal anchor
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Voyager: An Open-Ended Embodied Agent with Large Language Models cs.AI · 2023-05-25 · unverdicted · none · ref 24 · internal anchor
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models cs.CV · 2023-03-08 · accept · none · ref 50 · internal anchor
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 145 · internal anchor
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval cs.CV · 2026-05-20 · unverdicted · none · ref 46 · internal anchor
STiTch refines LLM captions via embedding transition and uses set-to-set bidirectional transportation alignment to improve training-free zero-shot composed image retrieval.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 84 · internal anchor
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs cs.CV · 2026-04-14 · unverdicted · none · ref 31 · internal anchor
Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 45 · internal anchor
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 93 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Improving Factuality and Reasoning in Language Models through Multiagent Debate cs.CL · 2023-05-23 · unverdicted · none · ref 32 · internal anchor
Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action cs.CV · 2023-03-20 · unverdicted · none · ref 36 · internal anchor
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents cs.AI · 2023-02-03 · conditional · none · ref 48 · internal anchor
DEPS combines LLM-based interactive planning with a trainable goal selector to create a zero-shot multi-task agent that completes 70+ Minecraft tasks and nearly doubles prior performance.
Inner Monologue: Embodied Reasoning through Planning with Language Models cs.RO · 2022-07-12 · unverdicted · none · ref 19 · internal anchor
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 99 · internal anchor
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 70 · internal anchor
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs cs.CV · 2026-05-04 · unverdicted · none · ref 84 · internal anchor
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
LLM-Guided Task- and Affordance-Level Exploration in Reinforcement Learning cs.RO · 2025-09-20 · unverdicted · none · ref 18 · internal anchor
LLM-TALE steers RL exploration using LLM-generated plans at task and affordance levels with online suboptimality correction, improving sample efficiency and success rates on pick-and-place tasks without human supervision.
LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting cs.HC · 2024-12-12 · unverdicted · none · ref 59 · internal anchor
LIVE-GS uses an LLM to predict physical parameters from static Gaussian assets in 10 seconds for physics-aware VR interactions, validated by interviews, baseline comparisons, and user studies.
GIT: A Generative Image-to-text Transformer for Vision and Language cs.CV · 2022-05-27 · unverdicted · none · ref 35 · internal anchor
GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 205 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer