ViperGPT: Visual Inference via Python Execution for Reasoning
Pith reviewed 2026-05-17 18:09 UTC · model grok-4.3
The pith
ViperGPT uses code generation to create Python programs that combine vision models for answering complex visual queries without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that composing vision-and-language models via generated Python code executed at inference time solves complex visual tasks at state-of-the-art levels without any task-specific training.
What carries the argument
A code-generation model that writes and executes Python programs using a fixed API to available vision and language modules.
Load-bearing premise
A language model can consistently generate correct Python code that properly uses the vision modules for any given query.
What would settle it
Running the system on a benchmark where many generated programs contain syntax errors or logical mistakes that lead to wrong answers.
read the original abstract
Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ViperGPT, a framework that leverages code-generation models (e.g., Codex or GPT) to synthesize Python programs composing pre-trained vision-and-language modules via a provided API, enabling visual reasoning on complex queries without any additional training or fine-tuning. It claims this yields state-of-the-art results across visual tasks such as VQA.
Significance. If the empirical results prove robust, the work demonstrates that LLM-driven program synthesis can produce interpretable, modular visual reasoning systems that avoid end-to-end training, offering potential gains in generalization, error tracing, and reuse of existing vision modules.
major comments (2)
- [Experiments] Experiments section: The manuscript reports state-of-the-art results on datasets including GQA and OK-VQA yet provides no aggregate statistics on code-generation success rate, retry frequency, or error types (logic errors, API misuse, execution failures) across the full test sets. This is load-bearing for the central claim that the 'simple approach' reliably achieves SOTA without training, because performance may reflect only the subset of queries where the LLM produces correct executable code.
- [Section 3] Section 3: The few-shot prompting procedure for generating code is described, but the text does not quantify or bound the reliability of the generated programs for arbitrary queries, nor does it detail how failed generations are filtered or retried before reporting final accuracy numbers.
minor comments (2)
- [Abstract] The abstract asserts 'state-of-the-art results across various complex visual tasks' without naming the specific datasets or reporting the magnitude of improvement over baselines.
- [Section 3] Notation for the vision modules and API calls could be made more consistent between the method description and the example programs shown in figures.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The manuscript reports state-of-the-art results on datasets including GQA and OK-VQA yet provides no aggregate statistics on code-generation success rate, retry frequency, or error types (logic errors, API misuse, execution failures) across the full test sets. This is load-bearing for the central claim that the 'simple approach' reliably achieves SOTA without training, because performance may reflect only the subset of queries where the LLM produces correct executable code.
Authors: We agree that aggregate statistics on code-generation success would strengthen the central claim. In the revised manuscript we will add a new analysis subsection to the Experiments section reporting overall code-generation success rate, retry counts, and error-type breakdown (syntax, API misuse, execution, logic) across the full GQA and OK-VQA test sets. These numbers will show that the reported accuracies reflect the complete test distributions after transparent retry handling rather than a cherry-picked subset. revision: yes
-
Referee: [Section 3] Section 3: The few-shot prompting procedure for generating code is described, but the text does not quantify or bound the reliability of the generated programs for arbitrary queries, nor does it detail how failed generations are filtered or retried before reporting final accuracy numbers.
Authors: We will expand Section 3 with empirical quantification of code-generation reliability drawn from our validation experiments (success rates on held-out queries) and a clear description of the retry and filtering procedure (re-prompting with error feedback or fallback to a default program). We note, however, that a general theoretical bound on reliability for arbitrary queries lies outside the scope of this empirical study and would require assumptions about the underlying LLM that we do not claim. revision: partial
- A theoretical bound on reliability for arbitrary queries
Circularity Check
No significant circularity; empirical claims rest on benchmark results
full rationale
The paper introduces ViperGPT as a training-free framework that uses an off-the-shelf code-generation model to compose provided vision modules via Python execution. Its central claims (no further training required, SOTA on visual reasoning tasks) are supported by experimental evaluation on standard benchmarks rather than by any derivation, equation, or first-principles prediction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described method; the approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A code-generation model can produce correct Python programs that correctly invoke the provided vision modules for the target queries.
- domain assumption The supplied API exposes a sufficient set of vision-and-language modules to solve the evaluated tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines... This simple approach requires no further training
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
Time Series Augmented Generation for Financial Applications
TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.
-
A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection
A DSL combined with LLMs generates consistent, low-latency triggers for selective multimodal sensor data collection, outperforming direct code generation in consistency and speed with comparable detection performance.
-
Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.
-
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.
-
Grounded Reinforcement Learning for Visual Reasoning
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
-
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation
MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and red...
-
MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks
MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
-
Chat Modeling: Interaction-Enhanced Agent Framework for Visualizing Literature-Grounded Biological Structures
Chat Modeling is a multi-agent LLM framework with modeling memory and dynamic chat widgets that translates text inputs into interactive 3D modeling operations for literature-grounded biological structures.
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj ...
work page 2022
-
[2]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2016
work page 2016
-
[3]
Systematic Generalization: What Is Required and Can It Be Learned?
Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic Generalization: What Is Required and Can It Be Learned?, Apr. 2019. arXiv:1811.12889 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
arXiv:1709.08568 [cs.LG].https: //arxiv.org/abs/1709.08568
Yoshua Bengio. The Consciousness Prior, Dec. 2019. arXiv:1709.08568 [cs, stat]
-
[5]
Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox
Maria A. Bravo, Sudhanshu Mittal, Simon Ging, and Thomas Brox. Open-vocabulary attribute detection. arXiv preprint arXiv:2211.12914, 2022
-
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
Revisiting the" video" in video-language understanding
Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2917–2927, 2022
work page 2022
-
[8]
In- terpretable Visual Question Answering by Reasoning on De- pendency Trees
Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. In- terpretable Visual Question Answering by Reasoning on De- pendency Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3):887–901, Mar. 2021
work page 2021
-
[9]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ry- der, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Moham- mad Bavarian, Clemens...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling com- putation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Visual Grounding via Accumulated At- tention
Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual Grounding via Accumulated At- tention
- [12]
-
[13]
Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering
Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, and Prem Natarajan. Transform-retrieve- generate: Natural language-centric outside-knowledge vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5067–5077, 2022
work page 2022
-
[14]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Cofar: Commonsense and factual reasoning in image search
Prajwal Gatti, Abhirama Subramanyam Penamakuri, Revant Teotia, Anand Mishra, Shubhashis Sengupta, and Roshni Ramnani. Cofar: Commonsense and factual reasoning in image search. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Proc...
work page 2022
-
[16]
KAT: A knowledge augmented transformer for vision-and-language
Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander Hauptmann, Yonatan Bisk, and Jianfeng Gao. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 956–968, Seat- tle, United States, July 202...
work page 2022
-
[17]
Visual programming: Compositional visual reason- ing without training
Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022
-
[18]
In- terpretable visual reasoning: A survey
Feijuan He, Yaxian Wang, Xianglin Miao, and Xia Sun. In- terpretable visual reasoning: A survey. Image and Vision Computing, 112:104194, 2021
work page 2021
-
[19]
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to Reason: End-to- End Module Networks for Visual Question Answering.2017 IEEE International Conference on Computer Vision (ICCV), pages 804–813, Oct. 2017. Conference Name: 2017 IEEE International Conference on Computer Vision (ICCV) ISBN: 9781538610329 Place: Venice P...
work page 2017
-
[20]
Language-conditioned graph networks for relational reasoning
Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10294–10303, 2019
work page 2019
-
[21]
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699, 2022
-
[22]
Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge mem- ory. arXiv preprint arXiv:2212.05221, 2022
-
[23]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Learning by ab- straction: The neural state machine
Drew Hudson and Christopher D Manning. Learning by ab- straction: The neural state machine. Advances in Neural In- formation Processing Systems, 32, 2019
work page 2019
-
[25]
Drew A. Hudson and Christopher D. Manning. Composi- tional Attention Networks for Machine Reasoning. ArXiv, 2018
work page 2018
-
[26]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering
Drew A. Hudson and Christopher D. Manning. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, May 2019. arXiv:1902.09506 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[27]
Lawrence Zitnick, and Ross Girshick
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and Executing Programs for Visual Rea- soning. pages 2989–2998, 2017
work page 2017
- [28]
-
[29]
Vi- sual reasoning by progressive module networks
Seung Wook Kim, Makarand Tapaswi, and Sanja Fidler. Vi- sual reasoning by progressive module networks. In Interna- tional Conference on Learning Representations, 2019
work page 2019
-
[30]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, Jan
-
[31]
arXiv:2301.12597 [cs]
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022
work page 2022
-
[33]
Retrieval augmented visual ques- tion answering with outside knowledge
Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics
work page 2022
-
[34]
REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering
Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chen- guang Zhu, and Lu Yuan. REVIVE: Regional visual rep- resentation matters in knowledge-based visual question an- swering. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Informa- tion Processing Systems, 2022
work page 2022
-
[35]
Language models of code are few-shot commonsense learners
Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 , 2022
-
[36]
Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec
Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Menon, Junfeng Yang, Xin Wang, and Carl V ondrick. Dou- bly Right Object Recognition: A Why Prompt for Visual Ra- tionales, Dec. 2022. arXiv:2212.06202 [cs]
-
[37]
OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. OK-VQA: A Visual Question Answer- ing Benchmark Requiring External Knowledge. May 2019
work page 2019
-
[38]
Visual Classification via Description from Large Language Models, Dec
Sachit Menon and Carl V ondrick. Visual Classification via Description from Large Language Models, Dec. 2022. arXiv:2210.07183 [cs]
-
[39]
Simple open-vocabulary object detection with vision transformers
Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection with vi- sion transformers. arXiv preprint arXiv:2205.06230, 2022
-
[40]
Coarse-to-fine reason- ing for visual question answering
Binh X Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Coarse-to-fine reason- ing for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4558–4566, 2022
work page 2022
-
[41]
Talm: Tool augmente d language models
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool aug- mented language models. arXiv preprint arXiv:2205.12255, 2022
-
[42]
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. Multimodal Explanations: Justifying Decisions and Pointing to the Evidence. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8779– 8788, Salt Lake City, UT, June 2018. IEEE
work page 2018
-
[43]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai- son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-per...
work page 2019
-
[44]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021
work page 2021
-
[45]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3), 2022
work page 2022
-
[46]
Revant Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, et al. Mumuqa: Multimedia multi-hop news question answering via cross-media knowl- edge extraction and grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11200–11208, 2022
work page 2022
-
[47]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba- tra. Grad-CAM: Visual Explanations from Deep Net- works via Gradient-based Localization. International Jour- nal of Computer Vision, 128(2):336–359, Feb. 2020. arXiv: 1610.02391
-
[49]
arXiv preprint arXiv:2005.00724
Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolf- son, Sameer Singh, Jonathan Berant, and Matt Gardner. Ob- taining Faithful Interpretations from Compositional Neural Networks, Sept. 2020. arXiv:2005.00724 [cs]
-
[50]
Reclip: A strong zero-shot baseline for referring expression compre- hension
Sanjay Subramanian, Will Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, and Anna Rohrbach. Reclip: A strong zero-shot baseline for referring expression compre- hension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics , Dublin, Ireland, May 2022. Association for Computational Linguistics
work page 2022
- [51]
-
[52]
Lxmert: Learning cross- modality encoder representations from transformers
Hao Tan and Mohit Bansal. Lxmert: Learning cross- modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019
-
[53]
Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven C.H. Hoi. Plug-and-play VQA: Zero- shot VQA by conjoining large pretrained models with zero training. In Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 951–967, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computa- tional Linguistics
work page 2022
-
[54]
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. CoRR, abs/2202.03052, 2022
-
[55]
Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage
Xingyao Wang, Sha Li, and Heng Ji. Code4struct: Code gen- eration for few-shot structured prediction from natural lan- guage. arXiv preprint arXiv:2210.12810, 2022
-
[56]
Language models with im- age descriptors are strong few-shot video-language learners
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. 2022
work page 2022
-
[57]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models, Oct. 2022. arXiv:2201.11903 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Separating skills and concepts for novel vi- sual question answering
Spencer Whitehead, Hui Wu, Heng Ji, Rogerio Feris, and Kate Saenko. Separating skills and concepts for novel vi- sual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5632–5641, June 2021
work page 2021
-
[59]
Video graph transformer for video question answering
Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answering. In European Conference on Computer Vision , pages 39–58. Springer, 2022
work page 2022
-
[60]
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Im- age Caption Generation with Visual Attention, Apr. 2016. arXiv:1502.03044 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[61]
An empirical study of gpt-3 for few-shot knowledge-based vqa
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022
work page 2022
-
[62]
URL https://doi.org/10.48550/arXiv.2212.14546
Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training. arXiv preprint arXiv:2212.14546, 2022
-
[63]
Torralba, Pushmeet Kohli, and J
Kexin Yi, Jiajun Wu, Chuang Gan, A. Torralba, Pushmeet Kohli, and J. Tenenbaum. Neural-Symbolic VQA: Disentan- gling Reasoning from Vision and Language Understanding. ArXiv, 2018
work page 2018
-
[64]
Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choro- manski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic mod- els: Composing zero-shot multimodal reasoning with lan- guage. arXiv, 2022
work page 2022
-
[65]
Multi- grained vision language pre-training: Align- ing texts with visual concepts
Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. arXiv preprint arXiv:2111.08276, 2021
-
[66]
Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
Yundong Zhang, Juan Carlos Niebles, and Alvaro Soto. In- terpretable Visual Question Answering by Visual Ground- ing from Attention Supervision Mining, Aug. 2018. arXiv:1808.00265 [cs]. A. Pretrained Models We specify details about all the pretrained models used, as well as the code-generation large language model: • GLIP [31]. We use the implementation f...
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.