pith. sign in

arxiv: 2303.08128 · v1 · pith:EJIVFRXUnew · submitted 2023-03-14 · 💻 cs.CV

ViperGPT: Visual Inference via Python Execution for Reasoning

Pith reviewed 2026-05-17 18:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual reasoningcode generationvision language modelsprogrammatic compositionvisual question answeringmodular reasoningpython executionzero-shot inference
0
0 comments X

The pith

ViperGPT uses code generation to create Python programs that combine vision models for answering complex visual queries without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual queries require both seeing and reasoning, but most models blend them in ways that hurt understanding and flexibility. ViperGPT instead has a language model write Python code that calls on separate vision tools as needed and then runs that code to produce the answer. This requires no extra training on the target tasks and still reaches the best known performance on several challenging visual reasoning benchmarks. The explicit programs make the steps clear and allow the system to handle new combinations of queries.

Core claim

The central discovery is that composing vision-and-language models via generated Python code executed at inference time solves complex visual tasks at state-of-the-art levels without any task-specific training.

What carries the argument

A code-generation model that writes and executes Python programs using a fixed API to available vision and language modules.

Load-bearing premise

A language model can consistently generate correct Python code that properly uses the vision modules for any given query.

What would settle it

Running the system on a benchmark where many generated programs contain syntax errors or logical mistakes that lead to wrong answers.

read the original abstract

Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ViperGPT, a framework that leverages code-generation models (e.g., Codex or GPT) to synthesize Python programs composing pre-trained vision-and-language modules via a provided API, enabling visual reasoning on complex queries without any additional training or fine-tuning. It claims this yields state-of-the-art results across visual tasks such as VQA.

Significance. If the empirical results prove robust, the work demonstrates that LLM-driven program synthesis can produce interpretable, modular visual reasoning systems that avoid end-to-end training, offering potential gains in generalization, error tracing, and reuse of existing vision modules.

major comments (2)
  1. [Experiments] Experiments section: The manuscript reports state-of-the-art results on datasets including GQA and OK-VQA yet provides no aggregate statistics on code-generation success rate, retry frequency, or error types (logic errors, API misuse, execution failures) across the full test sets. This is load-bearing for the central claim that the 'simple approach' reliably achieves SOTA without training, because performance may reflect only the subset of queries where the LLM produces correct executable code.
  2. [Section 3] Section 3: The few-shot prompting procedure for generating code is described, but the text does not quantify or bound the reliability of the generated programs for arbitrary queries, nor does it detail how failed generations are filtered or retried before reporting final accuracy numbers.
minor comments (2)
  1. [Abstract] The abstract asserts 'state-of-the-art results across various complex visual tasks' without naming the specific datasets or reporting the magnitude of improvement over baselines.
  2. [Section 3] Notation for the vision modules and API calls could be made more consistent between the method description and the example programs shown in figures.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The manuscript reports state-of-the-art results on datasets including GQA and OK-VQA yet provides no aggregate statistics on code-generation success rate, retry frequency, or error types (logic errors, API misuse, execution failures) across the full test sets. This is load-bearing for the central claim that the 'simple approach' reliably achieves SOTA without training, because performance may reflect only the subset of queries where the LLM produces correct executable code.

    Authors: We agree that aggregate statistics on code-generation success would strengthen the central claim. In the revised manuscript we will add a new analysis subsection to the Experiments section reporting overall code-generation success rate, retry counts, and error-type breakdown (syntax, API misuse, execution, logic) across the full GQA and OK-VQA test sets. These numbers will show that the reported accuracies reflect the complete test distributions after transparent retry handling rather than a cherry-picked subset. revision: yes

  2. Referee: [Section 3] Section 3: The few-shot prompting procedure for generating code is described, but the text does not quantify or bound the reliability of the generated programs for arbitrary queries, nor does it detail how failed generations are filtered or retried before reporting final accuracy numbers.

    Authors: We will expand Section 3 with empirical quantification of code-generation reliability drawn from our validation experiments (success rates on held-out queries) and a clear description of the retry and filtering procedure (re-prompting with error feedback or fallback to a default program). We note, however, that a general theoretical bound on reliability for arbitrary queries lies outside the scope of this empirical study and would require assumptions about the underlying LLM that we do not claim. revision: partial

standing simulated objections not resolved
  • A theoretical bound on reliability for arbitrary queries

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark results

full rationale

The paper introduces ViperGPT as a training-free framework that uses an off-the-shelf code-generation model to compose provided vision modules via Python execution. Its central claims (no further training required, SOTA on visual reasoning tasks) are supported by experimental evaluation on standard benchmarks rather than by any derivation, equation, or first-principles prediction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework assumes the existence of a reliable code-generation model and a fixed set of vision modules exposed through an API. No free parameters are fitted inside the method itself; the only external dependencies are the pre-trained modules and the code model.

axioms (2)
  • domain assumption A code-generation model can produce correct Python programs that correctly invoke the provided vision modules for the target queries.
    Invoked implicitly when the paper states that the generated code is executed to produce results.
  • domain assumption The supplied API exposes a sufficient set of vision-and-language modules to solve the evaluated tasks.
    Required for the composition approach to be viable; stated via the phrase 'utilizes a provided API to access the available modules'.

pith-pipeline@v0.9.0 · 5418 in / 1278 out tokens · 61502 ms · 2026-05-17T18:09:01.184006+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

    cs.DC 2026-05 unverdicted novelty 7.0

    PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.

  2. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  4. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  5. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  6. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  7. Time Series Augmented Generation for Financial Applications

    cs.AI 2026-04 unverdicted novelty 6.0

    TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.

  8. A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection

    cs.DB 2026-03 unverdicted novelty 6.0

    A DSL combined with LLMs generates consistent, low-latency triggers for selective multimodal sensor data collection, outperforming direct code generation in consistency and speed with comparable detection performance.

  9. Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

    cs.CV 2025-12 unverdicted novelty 6.0

    Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.

  10. PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

    cs.CV 2025-12 conditional novelty 6.0

    A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.

  11. Grounded Reinforcement Learning for Visual Reasoning

    cs.CV 2025-05 unverdicted novelty 6.0

    ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.

  12. What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

    cs.CV 2024-07 unverdicted novelty 6.0

    Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.

  13. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  14. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  15. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    cs.CV 2023-04 conditional novelty 6.0

    MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...

  16. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  17. MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

    cs.RO 2026-05 unverdicted novelty 5.0

    MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and red...

  18. MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

    cs.CV 2026-04 unverdicted novelty 5.0

    MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.

  19. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  20. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    cs.CV 2023-04 conditional novelty 5.0

    LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

  21. Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures

    cs.HC 2024-04 unverdicted novelty 4.0

    Chat Modeling is a multi-agent LLM framework with modeling memory and dynamic chat widgets that translates text inputs into interactive 3D modeling operations for literature-grounded biological structures.

  22. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    cs.CV 2023-09 conditional novelty 4.0

    GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

  23. A Comprehensive Overview of Large Language Models

    cs.CL 2023-07 unverdicted novelty 2.0

    A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 23 Pith papers · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj ...

  2. [2]

    Neural module networks

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2016

  3. [3]

    Systematic Generalization: What Is Required and Can It Be Learned?

    Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic Generalization: What Is Required and Can It Be Learned?, Apr. 2019. arXiv:1811.12889 [cs]

  4. [4]

    arXiv:1709.08568 [cs.LG].https: //arxiv.org/abs/1709.08568

    Yoshua Bengio. The Consciousness Prior, Dec. 2019. arXiv:1709.08568 [cs, stat]

  5. [5]

    Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox

    Maria A. Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox. Open-vocabulary attribute detection. arXiv preprint arXiv:2211.12914, 2022

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

  7. [7]

    Revisiting the" video" in video-language understanding

    Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022

  8. [8]

    In- terpretable Visual Question Answering by Reasoning on De- pendency Trees

    Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. In- terpretable Visual Question Answering by Reasoning on De- pendency Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3):887–901, Mar. 2021

  9. [9]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ry- der, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Moham- mad Bavarian, Clemens...

  10. [10]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling com- putation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

  11. [11]

    Visual Grounding via Accumulated At- tention

    Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual Grounding via Accumulated At- tention

  12. [12]

    Dijkstra

    E.W. Dijkstra. Information streams sharing a finite buffer. Information Processing Letters, 1(5):179–180, 1972

  13. [13]

    Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering

    Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, and Prem Natarajan. Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5067–5077, 2022

  14. [14]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

  15. [15]

    Cofar: Commonsense and factual reasoning in image search

    Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, and Roshni Ramnani. Cofar: Commonsense and factual reasoning in image search. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Proc...

  16. [16]

    KAT: A knowledge augmented transformer for vision-and-language

    Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 956–968, Seat- tle, United States, July 202...

  17. [17]

    Visual programming: Compositional visual reason- ing without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022

  18. [18]

    In- terpretable visual reasoning: A survey

    Feijuan He, Yaxian Wang, Xianglin Miao, and Xia Sun. In- terpretable visual reasoning: A survey. Image and Vision Computing, 112:104194, 2021

  19. [19]

    Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct

    Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct. 2017. Conference Name: 2017 IEEE International Conference on Computer Vision (ICCV) ISBN: 9781538610329 Place: Venice P...

  20. [20]

    Language-conditioned graph networks for relational reasoning

    Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10294–10303, 2019

  21. [21]

    A., and Luo, J

    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699, 2022

  22. [22]

    Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory

    Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory. arXiv preprint arXiv:2212.05221, 2022

  23. [23]

    Language Is Not All You Need: Aligning Perception with Language Models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  24. [24]

    Learning by ab- straction: The neural state machine

    Drew Hudson and Christopher D Manning. Learning by ab- straction: The neural state machine. Advances in Neural In- formation Processing Systems, 32, 2019

  25. [25]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. Composi- tional Attention Networks for Machine Reasoning. ArXiv, 2018

  26. [26]

    GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

    Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, May 2019. arXiv:1902.09506 [cs]

  27. [27]

    Lawrence Zitnick, and Ross Girshick

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs for Visual Rea- soning. pages 2989–2998, 2017

  28. [28]

    Thinking, fast and slow

    Daniel Kahneman. Thinking, fast and slow . macmillan, 2011

  29. [29]

    Vi- sual reasoning by progressive module networks

    Seung Wook Kim, Makarand Tapaswi, and Sanja Fidler. Vi- sual reasoning by progressive module networks. In Interna- tional Conference on Learning Representations, 2019

  30. [30]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan

  31. [31]

    arXiv:2301.12597 [cs]

  32. [32]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022

  33. [33]

    Retrieval augmented visual ques- tion answering with outside knowledge

    Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics

  34. [34]

    REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering

    Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chen- guang Zhu, and Lu Yuan. REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Informa- tion Processing Systems, 2022

  35. [35]

    Language models of code are few-shot commonsense learners

    Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 , 2022

  36. [36]

    Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec

    Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, and Carl V ondrick. Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec. 2022. arXiv:2212.06202 [cs]

  37. [37]

    OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge. May 2019

  38. [38]

    Visual Classification via Description from Large Language Models, Dec

    Sachit Menon and Carl V ondrick. Visual Classification via Description from Large Language Models, Dec. 2022. arXiv:2210.07183 [cs]

  39. [39]

    Simple open-vocabulary object detection with vision transformers

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection with vi- sion transformers. arXiv preprint arXiv:2205.06230, 2022

  40. [40]

    Coarse-to-fine reason- ing for visual question answering

    Binh X Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Coarse-to-fine reason- ing for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4558–4566, 2022

  41. [41]

    Talm: Tool augmente d language models

    Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool aug- mented language models. arXiv preprint arXiv:2205.12255, 2022

  42. [42]

    Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

    Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8779– 8788, Salt Lake City, UT, June 2018. IEEE

  43. [43]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

  44. [44]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

  45. [45]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022

  46. [46]

    Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding

    Revant Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, et al. Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11200–11208, 2022

  47. [47]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  48. [48]

    Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Net- works via Gradient-based Localization. International Jour- nal of Computer Vision, 128(2):336–359, Feb. 2020. arXiv: 1610.02391

  49. [49]

    arXiv preprint arXiv:2005.00724

    Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolf- son, Sameer Singh, Jonathan Berant, and Matt Gardner. Ob- taining Faithful Interpretations from Compositional Neural Networks, Sept. 2020. arXiv:2005.00724 [cs]

  50. [50]

    Reclip: A strong zero-shot baseline for referring expression compre- hension

    Sanjay Subramanian, Will Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for referring expression compre- hension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , Dublin, Ireland, May 2022. Association for Computational Linguistics

  51. [51]

    V ondrick

    Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl. V ondrick. Learning to learn words from visual scenes. European Conference on Computer Vision (ECCV), 2020

  52. [52]

    Lxmert: Learning cross- modality encoder representations from transformers

    Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

  53. [53]

    Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven C.H. Hoi. Plug-and-play VQA: Zero- shot VQA by conjoining large pretrained models with zero training. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 951–967, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computa- tional Linguistics

  54. [54]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052, 2022

  55. [55]

    Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage

    Xingyao Wang, Sha Li, and Heng Ji. Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage. arXiv preprint arXiv:2210.12810, 2022

  56. [56]

    Language models with im- age descriptors are strong few-shot video-language learners

    Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. 2022

  57. [57]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models, Oct. 2022. arXiv:2201.11903 [cs]

  58. [58]

    Separating skills and concepts for novel vi- sual question answering

    Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, and Kate Saenko. Separating skills and concepts for novel vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5632–5641, June 2021

  59. [59]

    Video graph transformer for video question answering

    Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answering. In European Conference on Computer Vision , pages 39–58. Springer, 2022

  60. [60]

    Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Im- age Caption Generation with Visual Attention, Apr. 2016. arXiv:1502.03044 [cs]

  61. [61]

    An empirical study of gpt-3 for few-shot knowledge-based vqa

    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022

  62. [62]

    URL https://doi.org/10.48550/arXiv.2212.14546

    Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training. arXiv preprint arXiv:2212.14546, 2022

  63. [63]

    Torralba, Pushmeet Kohli, and J

    Kexin Yi, Jiajun Wu, Chuang Gan, A. Torralba, Pushmeet Kohli, and J. Tenenbaum. Neural-Symbolic VQA: Disentan- gling Reasoning from Vision and Language Understanding. ArXiv, 2018

  64. [64]

    Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage

    Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage. arXiv, 2022

  65. [65]

    Multi- grained vision language pre-training: Align- ing texts with visual concepts

    Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. arXiv preprint arXiv:2111.08276, 2021

  66. [66]

    Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

    Yundong Zhang, Juan Carlos Niebles, and Alvaro Soto. In- terpretable Visual Question Answering by Visual Ground- ing from Attention Supervision Mining, Aug. 2018. arXiv:1808.00265 [cs]. A. Pretrained Models We specify details about all the pretrained models used, as well as the code-generation large language model: • GLIP [31]. We use the implementation f...