pith. machine review for the scientific record. sign in

arxiv: 2303.04671 · v1 · submitted 2023-03-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords Visual ChatGPTmultimodal interactionprompt engineeringvisual foundation modelsimage editingChatGPTmulti-step reasoning
0
0 comments X

The pith

Visual ChatGPT lets users chat with images by linking ChatGPT to visual foundation models through prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a system that adds image sending, receiving, and editing to ChatGPT conversations. It does this by writing prompts that describe the inputs, outputs, and capabilities of various visual foundation models so ChatGPT can choose and combine them for multi-step tasks. Users can also give feedback to request corrections on the visual results. A sympathetic reader would care because this turns a text-only model into one that handles complex visual instructions without retraining either ChatGPT or the visual models. The work shows how prompt design can create reliable collaboration between language and vision systems.

Core claim

By designing prompts that inject information about multiple visual foundation models into ChatGPT, the system enables users to send images, pose complex visual questions, issue multi-step editing instructions, and receive corrected results through iterative feedback, all while ChatGPT orchestrates the collaboration of the visual models.

What carries the argument

A series of prompts that describe each visual model's inputs, outputs, and feedback requirements so ChatGPT can select and sequence them for multi-step visual tasks.

If this is right

  • Users can iterate on images by sending feedback and receiving revised outputs in the same conversation.
  • Complex visual questions become solvable by breaking them into steps that use different models for generation, detection, or editing.
  • The same prompt approach can be applied to other visual models as they are developed.
  • ChatGPT gains the ability to output images directly instead of only describing them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could generalize to audio or video models by writing similar capability descriptions.
  • It suggests prompt engineering might serve as a lightweight way to add new modalities to existing language models.
  • Future systems might combine this with user-provided examples to further reduce selection mistakes.

Load-bearing premise

ChatGPT will correctly decompose visual requests and pick the right sequence of models from the prompt descriptions without frequent errors.

What would settle it

A set of ten multi-step visual editing instructions where the system repeatedly chooses the wrong model or fails to chain steps correctly would show the prompt injection does not produce reliable collaboration.

read the original abstract

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents Visual ChatGPT, a system integrating ChatGPT with multiple Visual Foundation Models (e.g., for image understanding and generation) via prompt engineering. This enables users to send/receive images, pose complex multi-step visual queries requiring model collaboration, and provide feedback for iterative corrections. The work is demonstrated through qualitative interaction examples and released as open-source code.

Significance. If the system performs as described, it provides a practical demonstration of extending conversational language models to visual domains through orchestration of existing foundation models. This could accelerate research on multimodal interfaces. The public code release supports reproducibility and further experimentation, which is a clear strength for a system paper.

major comments (1)
  1. [Experiments] Experiments section: claims of effective multi-step collaboration and reliable model orchestration rest entirely on qualitative examples. No quantitative metrics, ablation studies on prompt variants, error rates for task decomposition, or failure analysis are provided, which limits assessment of the prompt-injection approach's robustness.
minor comments (1)
  1. [Method] The description of input/output typing and feedback loops in the prompt design could be expanded with a concrete example or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: claims of effective multi-step collaboration and reliable model orchestration rest entirely on qualitative examples. No quantitative metrics, ablation studies on prompt variants, error rates for task decomposition, or failure analysis are provided, which limits assessment of the prompt-injection approach's robustness.

    Authors: We agree that the current manuscript presents its results exclusively through qualitative interaction examples. Visual ChatGPT is positioned as a system demonstration paper whose primary contribution is the prompt-engineering framework that enables ChatGPT to orchestrate multiple visual foundation models for multi-turn, multi-modal tasks. Because the tasks are open-ended and conversational, standard quantitative metrics (e.g., task-completion accuracy or error rates) are difficult to define without introducing arbitrary task taxonomies that would themselves require extensive validation. We therefore did not include ablations or numerical benchmarks in the original submission. In the revised version we will add a dedicated subsection under Experiments that (1) enumerates representative failure modes we observed during development, (2) describes the prompt-design heuristics we adopted and why certain variants were rejected, and (3) provides a qualitative robustness analysis based on the released code. This addition will make the limitations of the prompt-injection approach more transparent while preserving the paper’s system-oriented scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering system for integrating existing visual foundation models with ChatGPT through prompt injection and multi-step orchestration. No mathematical derivations, fitted parameters, or predictions appear in the text. The core contribution is the prompt design and architecture, supported by qualitative examples and public code release rather than any self-referential theorem, uniqueness claim, or input-to-output reduction. All components rely on external pre-trained models without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that ChatGPT can reliably interpret injected visual model descriptions and coordinate their use for complex tasks.

axioms (1)
  • domain assumption ChatGPT can effectively follow complex instructions and coordinate tool use via prompts
    The system relies on the LLM's ability to parse tasks and select appropriate visual models based on injected information.

pith-pipeline@v0.9.0 · 5528 in / 1120 out tokens · 30708 ms · 2026-05-13T22:46:45.409554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cross-Modal Backdoors in Multimodal Large Language Models

    cs.CR 2026-05 unverdicted novelty 8.0

    Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.

  2. Probing Visual Planning in Image Editing Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

  3. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  4. CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

    cs.CV 2026-04 unverdicted novelty 7.0

    CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.

  5. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  6. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    cs.CV 2023-10 accept novelty 7.0

    Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

  7. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  8. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  9. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  10. RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.

  11. ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution

    cs.CL 2026-04 unverdicted novelty 6.0

    ToolOmni combines supervised fine-tuning on a cold-start multi-turn dataset with Decoupled Multi-Objective GRPO to enable proactive retrieval and grounded execution, yielding +10.8% higher end-to-end tool-use success ...

  12. Towards Long-horizon Agentic Multimodal Search

    cs.CV 2026-04 unverdicted novelty 6.0

    LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...

  13. Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.

  14. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    cs.CV 2024-01 unverdicted novelty 6.0

    Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.

  15. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  16. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    cs.CV 2023-04 conditional novelty 6.0

    MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...

  17. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    cs.CV 2023-03 unverdicted novelty 6.0

    MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.

  18. Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

    cs.CV 2026-05 unverdicted novelty 5.0

    MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

  19. MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks

    cs.CV 2026-04 unverdicted novelty 5.0

    MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.

  20. Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A self-reasoning agentic framework constructs a Product Narrative Framework, generates constraint-aware unified grid collages, and refines outputs via failure attribution to improve narrative coherence and aesthetics ...

  21. Less Detail, Better Answers: Degradation-Driven Prompting for VQA

    cs.CV 2026-04 unverdicted novelty 5.0

    Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.

  22. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  23. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    cs.CV 2023-04 conditional novelty 5.0

    LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

  24. UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 4.0

    UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.

  25. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

  26. A Survey of Large Language Models

    cs.CL 2023-03 accept novelty 3.0

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 26 Pith papers · 6 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022

  2. [2]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015

  3. [3]

    Vlmo: Unified vision-language pre- training with mixture-of-modality-experts

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, and Furu Wei. Vlmo: Unified vision-language pre- training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021

  4. [4]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022

  5. [5]

    Lan- guage models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020

  6. [6]

    Realtime multi-person 2d pose estimation using part affinity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017

  7. [7]

    LangChain, 10 2022

    Harrison Chase. LangChain, 10 2022

  8. [8]

    Visualgpt: Data-efficient adaptation of pretrained language models for image captioning

    Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed El- hoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022

  9. [9]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX , pages 104–120. Springer, 2020

  10. [10]

    Per- pixel classification is not all you need for semantic segmen- tation

    Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021

  11. [11]

    Commonsense reasoning and commonsense knowledge in artificial intelligence.Com- munications of the ACM, 58(9):92–103, 2015

    Ernest Davis and Gary Marcus. Commonsense reasoning and commonsense knowledge in artificial intelligence.Com- munications of the ACM, 58(9):92–103, 2015

  12. [12]

    Magma–multimodal augmentation of generative models through adapter-based finetuning

    Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021

  13. [13]

    Violet: End- to-end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling. arXiv preprint arXiv:2111.12681, 2021

  14. [14]

    Large-scale adversarial training for vision- and-language representation learning

    Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision- and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020

  15. [15]

    Semantic compositional networks for visual captioning

    Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 5630–5639, 2017

  16. [16]

    Towards light-weight and real-time line segment detection

    Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 726–734, 2022

  17. [17]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  18. [18]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

  19. [19]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019

  20. [20]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Pro- cessing Systems, 2022

  21. [21]

    Manigan: Text-guided image manipulation

    Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. Manigan: Text-guided image manipulation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7880–7889, 2020

  22. [22]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  23. [23]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022

  24. [24]

    Uniformer: Uni- fying convolution and self-attention for visual recognition

    Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022

  25. [25]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer, 2020

  26. [26]

    Su- pervision exists everywhere: A data efficient contrastive language-image pre-training paradigm

    Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Su- pervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International Conference on Learning Representations

  27. [27]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014

  28. [28]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2023

  29. [29]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- wal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

  30. [30]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  31. [31]

    Language models are unsu- pervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners

  32. [32]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

  33. [33]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021

  34. [34]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  35. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  36. [36]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagn ´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall´e, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

  37. [37]

    Learning to summarize with human feed- back

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back. Advances in Neural Information Processing Systems , 33:3008–3021, 2020

  38. [38]

    Multimodal few-shot learning with frozen language models

    Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Es- lami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021

  39. [39]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  40. [40]

    Show and tell: A neural image caption gen- erator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015

  41. [41]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  42. [42]

    Chain-of-thought prompting elicits reasoning in large lan- guage models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. In Advances in Neural Information Process- ing Systems, 2022

  43. [43]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-t...

  44. [44]

    Holistically-nested edge de- tection

    Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015

  45. [45]

    Canny edge detection based on open cv

    Zhao Xu, Xu Baojie, and Wu Guoxin. Canny edge detection based on open cv. In 2017 13th IEEE international con- ference on electronic measurement & instruments (ICEMI) , pages 53–56. IEEE, 2017

  46. [46]

    An empirical study of gpt-3 for few-shot knowledge-based vqa

    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022

  47. [47]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems

  48. [48]

    From recognition to cognition: Visual commonsense rea- soning

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense rea- soning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

  49. [49]

    Merlot reserve: Neu- ral script knowledge through vision and language and sound

    Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yan- peng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neu- ral script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022

  50. [50]

    Socratic models: Composing zero-shot multimodal reasoning with language,

    Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro- manski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022

  51. [51]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022

  52. [52]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18123–18133, 2022

  53. [53]

    Adding conditional control to text-to-image diffusion models,

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

  54. [54]

    Vinvl: Making visual representations matter in vision- language models

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Making visual representations matter in vision- language models. arXiv preprint arXiv:2101.00529, 1(6):8, 2021

  55. [55]

    Text as neural operator: Image manipulation by text instruction

    Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, and Irfan Essa. Text as neural operator: Image manipulation by text instruction. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1893– 1902, 2021

  56. [56]

    Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

  57. [57]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023

  58. [58]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables com- plex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. A. Tool Details • Remove Something From The Photo: Model: “runwayml/stable-diffusion-inpainting” from Hu...