ViperGPT: Visual Inference via Python Execution for Reasoning

Carl Vondrick; D\'idac Sur\'is; Sachit Menon

arxiv: 2303.08128 · v1 · pith:EJIVFRXUnew · submitted 2023-03-14 · 💻 cs.CV

ViperGPT: Visual Inference via Python Execution for Reasoning

D\'idac Sur\'is , Sachit Menon , Carl Vondrick This is my paper

Pith reviewed 2026-05-17 18:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual reasoningcode generationvision language modelsprogrammatic compositionvisual question answeringmodular reasoningpython executionzero-shot inference

0 comments

The pith

ViperGPT uses code generation to create Python programs that combine vision models for answering complex visual queries without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual queries require both seeing and reasoning, but most models blend them in ways that hurt understanding and flexibility. ViperGPT instead has a language model write Python code that calls on separate vision tools as needed and then runs that code to produce the answer. This requires no extra training on the target tasks and still reaches the best known performance on several challenging visual reasoning benchmarks. The explicit programs make the steps clear and allow the system to handle new combinations of queries.

Core claim

The central discovery is that composing vision-and-language models via generated Python code executed at inference time solves complex visual tasks at state-of-the-art levels without any task-specific training.

What carries the argument

A code-generation model that writes and executes Python programs using a fixed API to available vision and language modules.

Load-bearing premise

A language model can consistently generate correct Python code that properly uses the vision modules for any given query.

What would settle it

Running the system on a benchmark where many generated programs contain syntax errors or logical mistakes that lead to wrong answers.

read the original abstract

Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ViperGPT, a framework that leverages code-generation models (e.g., Codex or GPT) to synthesize Python programs composing pre-trained vision-and-language modules via a provided API, enabling visual reasoning on complex queries without any additional training or fine-tuning. It claims this yields state-of-the-art results across visual tasks such as VQA.

Significance. If the empirical results prove robust, the work demonstrates that LLM-driven program synthesis can produce interpretable, modular visual reasoning systems that avoid end-to-end training, offering potential gains in generalization, error tracing, and reuse of existing vision modules.

major comments (2)

[Experiments] Experiments section: The manuscript reports state-of-the-art results on datasets including GQA and OK-VQA yet provides no aggregate statistics on code-generation success rate, retry frequency, or error types (logic errors, API misuse, execution failures) across the full test sets. This is load-bearing for the central claim that the 'simple approach' reliably achieves SOTA without training, because performance may reflect only the subset of queries where the LLM produces correct executable code.
[Section 3] Section 3: The few-shot prompting procedure for generating code is described, but the text does not quantify or bound the reliability of the generated programs for arbitrary queries, nor does it detail how failed generations are filtered or retried before reporting final accuracy numbers.

minor comments (2)

[Abstract] The abstract asserts 'state-of-the-art results across various complex visual tasks' without naming the specific datasets or reporting the magnitude of improvement over baselines.
[Section 3] Notation for the vision modules and API calls could be made more consistent between the method description and the example programs shown in figures.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The manuscript reports state-of-the-art results on datasets including GQA and OK-VQA yet provides no aggregate statistics on code-generation success rate, retry frequency, or error types (logic errors, API misuse, execution failures) across the full test sets. This is load-bearing for the central claim that the 'simple approach' reliably achieves SOTA without training, because performance may reflect only the subset of queries where the LLM produces correct executable code.

Authors: We agree that aggregate statistics on code-generation success would strengthen the central claim. In the revised manuscript we will add a new analysis subsection to the Experiments section reporting overall code-generation success rate, retry counts, and error-type breakdown (syntax, API misuse, execution, logic) across the full GQA and OK-VQA test sets. These numbers will show that the reported accuracies reflect the complete test distributions after transparent retry handling rather than a cherry-picked subset. revision: yes
Referee: [Section 3] Section 3: The few-shot prompting procedure for generating code is described, but the text does not quantify or bound the reliability of the generated programs for arbitrary queries, nor does it detail how failed generations are filtered or retried before reporting final accuracy numbers.

Authors: We will expand Section 3 with empirical quantification of code-generation reliability drawn from our validation experiments (success rates on held-out queries) and a clear description of the retry and filtering procedure (re-prompting with error feedback or fallback to a default program). We note, however, that a general theoretical bound on reliability for arbitrary queries lies outside the scope of this empirical study and would require assumptions about the underlying LLM that we do not claim. revision: partial

standing simulated objections not resolved

A theoretical bound on reliability for arbitrary queries

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on benchmark results

full rationale

The paper introduces ViperGPT as a training-free framework that uses an off-the-shelf code-generation model to compose provided vision modules via Python execution. Its central claims (no further training required, SOTA on visual reasoning tasks) are supported by experimental evaluation on standard benchmarks rather than by any derivation, equation, or first-principles prediction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework assumes the existence of a reliable code-generation model and a fixed set of vision modules exposed through an API. No free parameters are fitted inside the method itself; the only external dependencies are the pre-trained modules and the code model.

axioms (2)

domain assumption A code-generation model can produce correct Python programs that correctly invoke the provided vision modules for the target queries.
Invoked implicitly when the paper states that the generated code is executed to produce results.
domain assumption The supplied API exposes a sufficient set of vision-and-language modules to solve the evaluated tasks.
Required for the composition approach to be viable; stated via the phrase 'utilizes a provided API to access the available modules'.

pith-pipeline@v0.9.0 · 5418 in / 1278 out tokens · 61502 ms · 2026-05-17T18:09:01.184006+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines... This simple approach requires no further training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
cs.DC 2026-05 unverdicted novelty 7.0

PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Time Series Augmented Generation for Financial Applications
cs.AI 2026-04 unverdicted novelty 6.0

TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.
A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection
cs.DB 2026-03 unverdicted novelty 6.0

A DSL combined with LLMs generates consistent, low-latency triggers for selective multimodal sensor data collection, outperforming direct code generation in consistency and speed with comparable detection performance.
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 6.0

Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
cs.CV 2025-12 conditional novelty 6.0

A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
Grounded Reinforcement Learning for Visual Reasoning
cs.CV 2025-05 unverdicted novelty 6.0

ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
cs.CV 2024-07 unverdicted novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation
cs.RO 2026-05 unverdicted novelty 5.0

MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and red...
MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks
cs.CV 2026-04 unverdicted novelty 5.0

MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
cs.CV 2023-04 conditional novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures
cs.HC 2024-04 unverdicted novelty 4.0

Chat Modeling is a multi-agent LLM framework with modeling memory and dynamic chat widgets that translates text inputs into interactive 3D modeling operations for literature-grounded biological structures.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
cs.CV 2023-09 conditional novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
A Comprehensive Overview of Large Language Models
cs.CL 2023-07 unverdicted novelty 2.0

A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 23 Pith papers · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj ...

work page 2022
[2]

Neural module networks

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2016

work page 2016
[3]

Systematic Generalization: What Is Required and Can It Be Learned?

Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic Generalization: What Is Required and Can It Be Learned?, Apr. 2019. arXiv:1811.12889 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

arXiv:1709.08568 [cs.LG].https: //arxiv.org/abs/1709.08568

Yoshua Bengio. The Consciousness Prior, Dec. 2019. arXiv:1709.08568 [cs, stat]

work page arXiv 2019
[5]

Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox

Maria A. Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox. Open-vocabulary attribute detection. arXiv preprint arXiv:2211.12914, 2022

work page arXiv 2022
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

Revisiting the" video" in video-language understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022

work page 2022
[8]

In- terpretable Visual Question Answering by Reasoning on De- pendency Trees

Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. In- terpretable Visual Question Answering by Reasoning on De- pendency Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3):887–901, Mar. 2021

work page 2021
[9]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ry- der, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Moham- mad Bavarian, Clemens...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling com- putation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Visual Grounding via Accumulated At- tention

Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual Grounding via Accumulated At- tention

work page
[12]

Dijkstra

E.W. Dijkstra. Information streams sharing a ﬁnite buffer. Information Processing Letters, 1(5):179–180, 1972

work page 1972
[13]

Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering

Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, and Prem Natarajan. Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5067–5077, 2022

work page 2022
[14]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Cofar: Commonsense and factual reasoning in image search

Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, and Roshni Ramnani. Cofar: Commonsense and factual reasoning in image search. In Proceedings of the 2nd Conference of the Asia-Paciﬁc Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Proc...

work page 2022
[16]

KAT: A knowledge augmented transformer for vision-and-language

Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 956–968, Seat- tle, United States, July 202...

work page 2022
[17]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022

work page arXiv 2022
[18]

In- terpretable visual reasoning: A survey

Feijuan He, Yaxian Wang, Xianglin Miao, and Xia Sun. In- terpretable visual reasoning: A survey. Image and Vision Computing, 112:104194, 2021

work page 2021
[19]

Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct. 2017. Conference Name: 2017 IEEE International Conference on Computer Vision (ICCV) ISBN: 9781538610329 Place: Venice P...

work page 2017
[20]

Language-conditioned graph networks for relational reasoning

Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10294–10303, 2019

work page 2019
[21]

A., and Luo, J

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699, 2022

work page arXiv 2022
[22]

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory. arXiv preprint arXiv:2212.05221, 2022

work page arXiv 2022
[23]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Learning by ab- straction: The neural state machine

Drew Hudson and Christopher D Manning. Learning by ab- straction: The neural state machine. Advances in Neural In- formation Processing Systems, 32, 2019

work page 2019
[25]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Composi- tional Attention Networks for Machine Reasoning. ArXiv, 2018

work page 2018
[26]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, May 2019. arXiv:1902.09506 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs for Visual Rea- soning. pages 2989–2998, 2017

work page 2017
[28]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow . macmillan, 2011

work page 2011
[29]

Vi- sual reasoning by progressive module networks

Seung Wook Kim, Makarand Tapaswi, and Sanja Fidler. Vi- sual reasoning by progressive module networks. In Interna- tional Conference on Learning Representations, 2019

work page 2019
[30]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan

work page
[31]

arXiv:2301.12597 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022

work page 2022
[33]

Retrieval augmented visual ques- tion answering with outside knowledge

Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics

work page 2022
[34]

REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering

Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chen- guang Zhu, and Lu Yuan. REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Informa- tion Processing Systems, 2022

work page 2022
[35]

Language models of code are few-shot commonsense learners

Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 , 2022

work page arXiv 2022
[36]

Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec

Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, and Carl V ondrick. Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec. 2022. arXiv:2212.06202 [cs]

work page arXiv 2022
[37]

OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge. May 2019

work page 2019
[38]

Visual Classiﬁcation via Description from Large Language Models, Dec

Sachit Menon and Carl V ondrick. Visual Classiﬁcation via Description from Large Language Models, Dec. 2022. arXiv:2210.07183 [cs]

work page arXiv 2022
[39]

Simple open-vocabulary object detection with vision transformers

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection with vi- sion transformers. arXiv preprint arXiv:2205.06230, 2022

work page arXiv 2022
[40]

Coarse-to-ﬁne reason- ing for visual question answering

Binh X Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Coarse-to-ﬁne reason- ing for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4558–4566, 2022

work page 2022
[41]

Talm: Tool augmente d language models

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool aug- mented language models. arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022
[42]

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8779– 8788, Salt Lake City, UT, June 2018. IEEE

work page 2018
[43]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019
[44]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

work page 2021
[45]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022

work page 2022
[46]

Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding

Revant Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, et al. Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 11200–11208, 2022

work page 2022
[47]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Net- works via Gradient-based Localization. International Jour- nal of Computer Vision, 128(2):336–359, Feb. 2020. arXiv: 1610.02391

work page arXiv 2020
[49]

arXiv preprint arXiv:2005.00724

Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolf- son, Sameer Singh, Jonathan Berant, and Matt Gardner. Ob- taining Faithful Interpretations from Compositional Neural Networks, Sept. 2020. arXiv:2005.00724 [cs]

work page arXiv 2020
[50]

Reclip: A strong zero-shot baseline for referring expression compre- hension

Sanjay Subramanian, Will Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for referring expression compre- hension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022
[51]

V ondrick

Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl. V ondrick. Learning to learn words from visual scenes. European Conference on Computer Vision (ECCV), 2020

work page 2020
[52]

Lxmert: Learning cross- modality encoder representations from transformers

Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

work page arXiv 1908
[53]

Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven C.H. Hoi. Plug-and-play VQA: Zero- shot VQA by conjoining large pretrained models with zero training. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 951–967, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computa- tional Linguistics

work page 2022
[54]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052, 2022

work page arXiv 2022
[55]

Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage

Xingyao Wang, Sha Li, and Heng Ji. Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage. arXiv preprint arXiv:2210.12810, 2022

work page arXiv 2022
[56]

Language models with im- age descriptors are strong few-shot video-language learners

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. 2022

work page 2022
[57]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models, Oct. 2022. arXiv:2201.11903 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Separating skills and concepts for novel vi- sual question answering

Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, and Kate Saenko. Separating skills and concepts for novel vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5632–5641, June 2021

work page 2021
[59]

Video graph transformer for video question answering

Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answering. In European Conference on Computer Vision , pages 39–58. Springer, 2022

work page 2022
[60]

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Im- age Caption Generation with Visual Attention, Apr. 2016. arXiv:1502.03044 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[61]

An empirical study of gpt-3 for few-shot knowledge-based vqa

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 3081–3089, 2022

work page 2022
[62]

URL https://doi.org/10.48550/arXiv.2212.14546

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training. arXiv preprint arXiv:2212.14546, 2022

work page arXiv 2022
[63]

Torralba, Pushmeet Kohli, and J

Kexin Yi, Jiajun Wu, Chuang Gan, A. Torralba, Pushmeet Kohli, and J. Tenenbaum. Neural-Symbolic VQA: Disentan- gling Reasoning from Vision and Language Understanding. ArXiv, 2018

work page 2018
[64]

Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage. arXiv, 2022

work page 2022
[65]

Multi- grained vision language pre-training: Align- ing texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. arXiv preprint arXiv:2111.08276, 2021

work page arXiv 2021
[66]

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Yundong Zhang, Juan Carlos Niebles, and Alvaro Soto. In- terpretable Visual Question Answering by Visual Ground- ing from Attention Supervision Mining, Aug. 2018. arXiv:1808.00265 [cs]. A. Pretrained Models We specify details about all the pretrained models used, as well as the code-generation large language model: • GLIP [31]. We use the implementation f...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj ...

work page 2022

[2] [2]

Neural module networks

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2016

work page 2016

[3] [3]

Systematic Generalization: What Is Required and Can It Be Learned?

Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic Generalization: What Is Required and Can It Be Learned?, Apr. 2019. arXiv:1811.12889 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

arXiv:1709.08568 [cs.LG].https: //arxiv.org/abs/1709.08568

Yoshua Bengio. The Consciousness Prior, Dec. 2019. arXiv:1709.08568 [cs, stat]

work page arXiv 2019

[5] [5]

Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox

Maria A. Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox. Open-vocabulary attribute detection. arXiv preprint arXiv:2211.12914, 2022

work page arXiv 2022

[6] [6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[7] [7]

Revisiting the" video" in video-language understanding

Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022

work page 2022

[8] [8]

In- terpretable Visual Question Answering by Reasoning on De- pendency Trees

Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. In- terpretable Visual Question Answering by Reasoning on De- pendency Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3):887–901, Mar. 2021

work page 2021

[9] [9]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ry- der, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Moham- mad Bavarian, Clemens...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling com- putation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Visual Grounding via Accumulated At- tention

Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual Grounding via Accumulated At- tention

work page

[12] [12]

Dijkstra

E.W. Dijkstra. Information streams sharing a ﬁnite buffer. Information Processing Letters, 1(5):179–180, 1972

work page 1972

[13] [13]

Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering

Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, and Prem Natarajan. Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5067–5077, 2022

work page 2022

[14] [14]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Cofar: Commonsense and factual reasoning in image search

Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, and Roshni Ramnani. Cofar: Commonsense and factual reasoning in image search. In Proceedings of the 2nd Conference of the Asia-Paciﬁc Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Proc...

work page 2022

[16] [16]

KAT: A knowledge augmented transformer for vision-and-language

Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 956–968, Seat- tle, United States, July 202...

work page 2022

[17] [17]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022

work page arXiv 2022

[18] [18]

In- terpretable visual reasoning: A survey

Feijuan He, Yaxian Wang, Xianglin Miao, and Xia Sun. In- terpretable visual reasoning: A survey. Image and Vision Computing, 112:104194, 2021

work page 2021

[19] [19]

Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct

Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct. 2017. Conference Name: 2017 IEEE International Conference on Computer Vision (ICCV) ISBN: 9781538610329 Place: Venice P...

work page 2017

[20] [20]

Language-conditioned graph networks for relational reasoning

Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10294–10303, 2019

work page 2019

[21] [21]

A., and Luo, J

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699, 2022

work page arXiv 2022

[22] [22]

Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory. arXiv preprint arXiv:2212.05221, 2022

work page arXiv 2022

[23] [23]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Learning by ab- straction: The neural state machine

Drew Hudson and Christopher D Manning. Learning by ab- straction: The neural state machine. Advances in Neural In- formation Processing Systems, 32, 2019

work page 2019

[25] [25]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Composi- tional Attention Networks for Machine Reasoning. ArXiv, 2018

work page 2018

[26] [26]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, May 2019. arXiv:1902.09506 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[27] [27]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs for Visual Rea- soning. pages 2989–2998, 2017

work page 2017

[28] [28]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow . macmillan, 2011

work page 2011

[29] [29]

Vi- sual reasoning by progressive module networks

Seung Wook Kim, Makarand Tapaswi, and Sanja Fidler. Vi- sual reasoning by progressive module networks. In Interna- tional Conference on Learning Representations, 2019

work page 2019

[30] [30]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan

work page

[31] [31]

arXiv:2301.12597 [cs]

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022

work page 2022

[33] [33]

Retrieval augmented visual ques- tion answering with outside knowledge

Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics

work page 2022

[34] [34]

REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering

Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chen- guang Zhu, and Lu Yuan. REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Informa- tion Processing Systems, 2022

work page 2022

[35] [35]

Language models of code are few-shot commonsense learners

Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 , 2022

work page arXiv 2022

[36] [36]

Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec

Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, and Carl V ondrick. Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec. 2022. arXiv:2212.06202 [cs]

work page arXiv 2022

[37] [37]

OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge. May 2019

work page 2019

[38] [38]

Visual Classiﬁcation via Description from Large Language Models, Dec

Sachit Menon and Carl V ondrick. Visual Classiﬁcation via Description from Large Language Models, Dec. 2022. arXiv:2210.07183 [cs]

work page arXiv 2022

[39] [39]

Simple open-vocabulary object detection with vision transformers

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection with vi- sion transformers. arXiv preprint arXiv:2205.06230, 2022

work page arXiv 2022

[40] [40]

Coarse-to-ﬁne reason- ing for visual question answering

Binh X Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Coarse-to-ﬁne reason- ing for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4558–4566, 2022

work page 2022

[41] [41]

Talm: Tool augmente d language models

Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool aug- mented language models. arXiv preprint arXiv:2205.12255, 2022

work page arXiv 2022

[42] [42]

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8779– 8788, Salt Lake City, UT, June 2018. IEEE

work page 2018

[43] [43]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...

work page 2019

[44] [44]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021

work page 2021

[45] [45]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022

work page 2022

[46] [46]

Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding

Revant Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, et al. Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 11200–11208, 2022

work page 2022

[47] [47]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Net- works via Gradient-based Localization. International Jour- nal of Computer Vision, 128(2):336–359, Feb. 2020. arXiv: 1610.02391

work page arXiv 2020

[49] [49]

arXiv preprint arXiv:2005.00724

Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolf- son, Sameer Singh, Jonathan Berant, and Matt Gardner. Ob- taining Faithful Interpretations from Compositional Neural Networks, Sept. 2020. arXiv:2005.00724 [cs]

work page arXiv 2020

[50] [50]

Reclip: A strong zero-shot baseline for referring expression compre- hension

Sanjay Subramanian, Will Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for referring expression compre- hension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022

[51] [51]

V ondrick

Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, and Carl. V ondrick. Learning to learn words from visual scenes. European Conference on Computer Vision (ECCV), 2020

work page 2020

[52] [52]

Lxmert: Learning cross- modality encoder representations from transformers

Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019

work page arXiv 1908

[53] [53]

Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven C.H. Hoi. Plug-and-play VQA: Zero- shot VQA by conjoining large pretrained models with zero training. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 951–967, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computa- tional Linguistics

work page 2022

[54] [54]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052, 2022

work page arXiv 2022

[55] [55]

Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage

Xingyao Wang, Sha Li, and Heng Ji. Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage. arXiv preprint arXiv:2210.12810, 2022

work page arXiv 2022

[56] [56]

Language models with im- age descriptors are strong few-shot video-language learners

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. 2022

work page 2022

[57] [57]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models, Oct. 2022. arXiv:2201.11903 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2022

[58] [58]

Separating skills and concepts for novel vi- sual question answering

Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, and Kate Saenko. Separating skills and concepts for novel vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5632–5641, June 2021

work page 2021

[59] [59]

Video graph transformer for video question answering

Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answering. In European Conference on Computer Vision , pages 39–58. Springer, 2022

work page 2022

[60] [60]

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Im- age Caption Generation with Visual Attention, Apr. 2016. arXiv:1502.03044 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[61] [61]

An empirical study of gpt-3 for few-shot knowledge-based vqa

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 3081–3089, 2022

work page 2022

[62] [62]

URL https://doi.org/10.48550/arXiv.2212.14546

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training. arXiv preprint arXiv:2212.14546, 2022

work page arXiv 2022

[63] [63]

Torralba, Pushmeet Kohli, and J

Kexin Yi, Jiajun Wu, Chuang Gan, A. Torralba, Pushmeet Kohli, and J. Tenenbaum. Neural-Symbolic VQA: Disentan- gling Reasoning from Vision and Language Understanding. ArXiv, 2018

work page 2018

[64] [64]

Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage. arXiv, 2022

work page 2022

[65] [65]

Multi- grained vision language pre-training: Align- ing texts with visual concepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. arXiv preprint arXiv:2111.08276, 2021

work page arXiv 2021

[66] [66]

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

Yundong Zhang, Juan Carlos Niebles, and Alvaro Soto. In- terpretable Visual Question Answering by Visual Ground- ing from Attention Supervision Mining, Aug. 2018. arXiv:1808.00265 [cs]. A. Pretrained Models We specify details about all the pretrained models used, as well as the code-generation large language model: • GLIP [31]. We use the implementation f...

work page internal anchor Pith review Pith/arXiv arXiv 2018