hub Canonical reference

ViperGPT: Visual Inference via Python Execution for Reasoning

· 2023 · cs.CV · arXiv 2303.08128

Canonical reference. 100% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

cs.DC · 2026-05-18 · unverdicted · novelty 7.0

PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

cs.RO · 2023-07-12 · unverdicted · novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

Visual Instruction Tuning

cs.CV · 2023-04-17 · unverdicted · novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

cs.CV · 2026-05-05 · unverdicted · novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

Time Series Augmented Generation for Financial Applications

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.

A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection

cs.DB · 2026-03-13 · unverdicted · novelty 6.0

A DSL combined with LLMs generates consistent, low-latency triggers for selective multimodal sensor data collection, outperforming direct code generation in consistency and speed with comparable detection performance.

Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

cs.CV · 2025-12-11 · unverdicted · novelty 6.0

Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

cs.CV · 2025-12-01 · conditional · novelty 6.0

A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.

Grounded Reinforcement Learning for Visual Reasoning

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

cs.CV · 2024-07-11 · unverdicted · novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

A Survey on Large Language Model based Autonomous Agents

cs.AI · 2023-08-22 · accept · novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

cs.CV · 2023-04-20 · conditional · novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, creative writing, and instruction following.

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

cs.CV · 2023-03-20 · unverdicted · novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

cs.RO · 2026-05-16 · unverdicted · novelty 5.0

MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and reduce wasted steps on the HM3D dataset.

MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

cs.CV · 2026-04-26 · unverdicted · novelty 5.0

MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

cs.CV · 2023-04-28 · conditional · novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures

cs.HC · 2024-04-01 · unverdicted · novelty 4.0

Chat Modeling is a multi-agent LLM framework with modeling memory and dynamic chat widgets that translates text inputs into interactive 3D modeling operations for literature-grounded biological structures.

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

cs.CV · 2023-09-29 · conditional · novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

A Comprehensive Overview of Large Language Models

cs.CL · 2023-07-12 · unverdicted · novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

citing papers explorer

Showing 23 of 23 citing papers.

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications cs.DC · 2026-05-18 · unverdicted · none · ref 69 · internal anchor
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 65 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models cs.RO · 2023-07-12 · unverdicted · none · ref 137 · internal anchor
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Visual Instruction Tuning cs.CV · 2023-04-17 · unverdicted · none · ref 46 · internal anchor
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 73 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 51 · internal anchor
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Time Series Augmented Generation for Financial Applications cs.AI · 2026-04-21 · unverdicted · none · ref 22 · internal anchor
TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.
A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection cs.DB · 2026-03-13 · unverdicted · none · ref 5 · internal anchor
A DSL combined with LLMs generates consistent, low-latency triggers for selective multimodal sensor data collection, outperforming direct code generation in consistency and speed with comparable detection performance.
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models cs.CV · 2025-12-11 · unverdicted · none · ref 22 · internal anchor
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models cs.CV · 2025-12-01 · conditional · none · ref 39 · internal anchor
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 56 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 56 · internal anchor
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 76 · internal anchor
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 75 · internal anchor
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models cs.CV · 2023-04-20 · conditional · none · ref 17 · internal anchor
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, creative writing, and instruction following.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action cs.CV · 2023-03-20 · unverdicted · none · ref 25 · internal anchor
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation cs.RO · 2026-05-16 · unverdicted · none · ref 44 · internal anchor
MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and reduce wasted steps on the HM3D dataset.
MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks cs.CV · 2026-04-26 · unverdicted · none · ref 31 · internal anchor
MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 134 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model cs.CV · 2023-04-28 · conditional · none · ref 61 · internal anchor
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures cs.HC · 2024-04-01 · unverdicted · none · ref 44 · internal anchor
Chat Modeling is a multi-agent LLM framework with modeling memory and dynamic chat widgets that translates text inputs into interactive 3D modeling operations for literature-grounded biological structures.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) cs.CV · 2023-09-29 · conditional · none · ref 121 · internal anchor
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 226 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

ViperGPT: Visual Inference via Python Execution for Reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer