arxiv: 2303.04671 · v1 · submitted 2023-03-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu , Shengming Yin , Weizhen Qi , Xiaodong Wang , Zecheng Tang , Nan Duan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords Visual ChatGPTmultimodal interactionprompt engineeringvisual foundation modelsimage editingChatGPTmulti-step reasoning

0 comments

The pith

Visual ChatGPT lets users chat with images by linking ChatGPT to visual foundation models through prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a system that adds image sending, receiving, and editing to ChatGPT conversations. It does this by writing prompts that describe the inputs, outputs, and capabilities of various visual foundation models so ChatGPT can choose and combine them for multi-step tasks. Users can also give feedback to request corrections on the visual results. A sympathetic reader would care because this turns a text-only model into one that handles complex visual instructions without retraining either ChatGPT or the visual models. The work shows how prompt design can create reliable collaboration between language and vision systems.

Core claim

By designing prompts that inject information about multiple visual foundation models into ChatGPT, the system enables users to send images, pose complex visual questions, issue multi-step editing instructions, and receive corrected results through iterative feedback, all while ChatGPT orchestrates the collaboration of the visual models.

What carries the argument

A series of prompts that describe each visual model's inputs, outputs, and feedback requirements so ChatGPT can select and sequence them for multi-step visual tasks.

If this is right

Users can iterate on images by sending feedback and receiving revised outputs in the same conversation.
Complex visual questions become solvable by breaking them into steps that use different models for generation, detection, or editing.
The same prompt approach can be applied to other visual models as they are developed.
ChatGPT gains the ability to output images directly instead of only describing them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could generalize to audio or video models by writing similar capability descriptions.
It suggests prompt engineering might serve as a lightweight way to add new modalities to existing language models.
Future systems might combine this with user-provided examples to further reduce selection mistakes.

Load-bearing premise

ChatGPT will correctly decompose visual requests and pick the right sequence of models from the prompt descriptions without frequent errors.

What would settle it

A set of ten multi-step visual editing instructions where the system repeatedly chooses the wrong model or fails to chain steps correctly would show the prompt injection does not produce reliable collaboration.

read the original abstract

ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Visual ChatGPT shows a working prompt-based system for chaining visual foundation models through ChatGPT, but the evaluation stays at the level of qualitative demos.

read the letter

The key takeaway is that this paper shows a working system for letting ChatGPT manage visual tasks by prompting it to use and chain existing visual foundation models like Stable Diffusion for generation or segmentation models for analysis. It handles multi-turn interactions where images are sent, edited, or queried in complex ways. The new part is the specific prompt engineering for describing models with multiple inputs and outputs, plus mechanisms for visual feedback and task chaining. This goes beyond basic tool calling by allowing the LLM to reason about when to call which model and how to combine results. The qualitative examples illustrate this effectively, showing things like iterative image editing based on user instructions. They do a good job making the system open source, which means the community can inspect and build on the implementation. The approach is straightforward and leverages current models without needing new training. On the weaker side, the paper relies entirely on these demo interactions without any quantitative metrics, ablation studies, or systematic error analysis. It's not clear how often the system fails to pick the right model or produces inconsistent results on harder prompts. That leaves the central claim about reliable collaboration somewhat untested. This work is for people developing multi-modal applications or studying how to integrate language and vision models in practice. Readers looking for implementation ideas rather than theoretical advances will get the most out of it. It should go to peer review because the system is novel in its coordination method and the code is available for verification.

Referee Report

1 major / 1 minor

Summary. The paper presents Visual ChatGPT, a system integrating ChatGPT with multiple Visual Foundation Models (e.g., for image understanding and generation) via prompt engineering. This enables users to send/receive images, pose complex multi-step visual queries requiring model collaboration, and provide feedback for iterative corrections. The work is demonstrated through qualitative interaction examples and released as open-source code.

Significance. If the system performs as described, it provides a practical demonstration of extending conversational language models to visual domains through orchestration of existing foundation models. This could accelerate research on multimodal interfaces. The public code release supports reproducibility and further experimentation, which is a clear strength for a system paper.

major comments (1)

[Experiments] Experiments section: claims of effective multi-step collaboration and reliable model orchestration rest entirely on qualitative examples. No quantitative metrics, ablation studies on prompt variants, error rates for task decomposition, or failure analysis are provided, which limits assessment of the prompt-injection approach's robustness.

minor comments (1)

[Method] The description of input/output typing and feedback loops in the prompt design could be expanded with a concrete example or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: claims of effective multi-step collaboration and reliable model orchestration rest entirely on qualitative examples. No quantitative metrics, ablation studies on prompt variants, error rates for task decomposition, or failure analysis are provided, which limits assessment of the prompt-injection approach's robustness.

Authors: We agree that the current manuscript presents its results exclusively through qualitative interaction examples. Visual ChatGPT is positioned as a system demonstration paper whose primary contribution is the prompt-engineering framework that enables ChatGPT to orchestrate multiple visual foundation models for multi-turn, multi-modal tasks. Because the tasks are open-ended and conversational, standard quantitative metrics (e.g., task-completion accuracy or error rates) are difficult to define without introducing arbitrary task taxonomies that would themselves require extensive validation. We therefore did not include ablations or numerical benchmarks in the original submission. In the revised version we will add a dedicated subsection under Experiments that (1) enumerates representative failure modes we observed during development, (2) describes the prompt-design heuristics we adopted and why certain variants were rejected, and (3) provides a qualitative robustness analysis based on the released code. This addition will make the limitations of the prompt-injection approach more transparent while preserving the paper’s system-oriented scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an engineering system for integrating existing visual foundation models with ChatGPT through prompt injection and multi-step orchestration. No mathematical derivations, fitted parameters, or predictions appear in the text. The core contribution is the prompt design and architecture, supported by qualitative examples and public code release rather than any self-referential theorem, uniqueness claim, or input-to-output reduction. All components rely on external pre-trained models without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that ChatGPT can reliably interpret injected visual model descriptions and coordinate their use for complex tasks.

axioms (1)

domain assumption ChatGPT can effectively follow complex instructions and coordinate tool use via prompts
The system relies on the LLM's ability to parse tasks and select appropriate visual models based on injected information.

pith-pipeline@v0.9.0 · 5528 in / 1120 out tokens · 30708 ms · 2026-05-13T22:46:45.409554+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cross-Modal Backdoors in Multimodal Large Language Models
cs.CR 2026-05 unverdicted novelty 8.0

Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.
Probing Visual Planning in Image Editing Models
cs.CV 2026-04 unverdicted novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
cs.CV 2026-04 unverdicted novelty 7.0

CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution
cs.CL 2026-04 unverdicted novelty 6.0

ToolOmni combines supervised fine-tuning on a cold-start multi-turn dataset with Decoupled Multi-Objective GRPO to enable proactive retrieval and grounded execution, yielding +10.8% higher end-to-end tool-use success ...
Towards Long-horizon Agentic Multimodal Search
cs.CV 2026-04 unverdicted novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
cs.CV 2024-01 unverdicted novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
cs.CV 2023-04 conditional novelty 6.0

MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
cs.CV 2026-05 unverdicted novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks
cs.CV 2026-04 unverdicted novelty 5.0

MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.
Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation
cs.CV 2026-04 unverdicted novelty 5.0

A self-reasoning agentic framework constructs a Product Narrative Framework, generates constraint-aware unified grid collages, and refines outputs via failure attribution to improve narrative coherence and aesthetics ...
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
cs.CV 2026-04 unverdicted novelty 5.0

Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
cs.CV 2023-04 conditional novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 4.0

UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 26 Pith papers · 6 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022

work page 2022
[2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015

work page 2015
[3]

Vlmo: Uniﬁed vision-language pre- training with mixture-of-modality-experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, and Furu Wei. Vlmo: Uniﬁed vision-language pre- training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021

work page arXiv 2021
[4]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022

work page arXiv 2022
[5]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020

work page 1901
[6]

Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part afﬁnity ﬁelds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017

work page 2017
[7]

LangChain, 10 2022

Harrison Chase. LangChain, 10 2022

work page 2022
[8]

Visualgpt: Data-efﬁcient adaptation of pretrained language models for image captioning

Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed El- hoseiny. Visualgpt: Data-efﬁcient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022

work page 2022
[9]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX , pages 104–120. Springer, 2020

work page 2020
[10]

Per- pixel classiﬁcation is not all you need for semantic segmen- tation

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classiﬁcation is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021

work page 2021
[11]

Commonsense reasoning and commonsense knowledge in artiﬁcial intelligence.Com- munications of the ACM, 58(9):92–103, 2015

Ernest Davis and Gary Marcus. Commonsense reasoning and commonsense knowledge in artiﬁcial intelligence.Com- munications of the ACM, 58(9):92–103, 2015

work page 2015
[12]

Magma–multimodal augmentation of generative models through adapter-based ﬁnetuning

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. Magma–multimodal augmentation of generative models through adapter-based ﬁnetuning. arXiv preprint arXiv:2112.05253, 2021

work page arXiv 2021
[13]

Violet: End- to-end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling. arXiv preprint arXiv:2111.12681, 2021

work page arXiv 2021
[14]

Large-scale adversarial training for vision- and-language representation learning

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision- and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020

work page 2020
[15]

Semantic compositional networks for visual captioning

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 5630–5639, 2017

work page 2017
[16]

Towards light-weight and real-time line segment detection

Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 726–734, 2022

work page 2022
[17]

Parameter-efﬁcient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019
[18]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

work page 2017
[19]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019

work page 2019
[20]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Pro- cessing Systems, 2022

work page 2022
[21]

Manigan: Text-guided image manipulation

Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. Manigan: Text-guided image manipulation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7880–7889, 2020

work page 2020
[22]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022

work page 2022
[24]

Uniformer: Uni- fying convolution and self-attention for visual recognition

Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022

work page arXiv 2022
[25]

Oscar: Object-semantics aligned pre-training for vision-language tasks

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer, 2020

work page 2020
[26]

Su- pervision exists everywhere: A data efﬁcient contrastive language-image pre-training paradigm

Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Su- pervision exists everywhere: A data efﬁcient contrastive language-image pre-training paradigm. In International Conference on Learning Representations

work page
[27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014

work page 2014
[28]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2023

work page 2023
[29]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- wal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[31]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners

work page
[32]

Exploring the limits of transfer learning with a uniﬁed text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

work page 2020
[33]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021

work page 2021
[34]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

work page 2020
[35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

work page 2022
[36]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagn ´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall´e, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Learning to summarize with human feed- back

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back. Advances in Neural Information Processing Systems , 33:3008–3021, 2020

work page 2020
[38]

Multimodal few-shot learning with frozen language models

Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Es- lami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021

work page 2021
[39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[40]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015

work page 2015
[41]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Chain-of-thought prompting elicits reasoning in large lan- guage models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. In Advances in Neural Information Process- ing Systems, 2022

work page 2022
[43]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-t...

work page 2020
[44]

Holistically-nested edge de- tection

Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015

work page 2015
[45]

Canny edge detection based on open cv

Zhao Xu, Xu Baojie, and Wu Guoxin. Canny edge detection based on open cv. In 2017 13th IEEE international con- ference on electronic measurement & instruments (ICEMI) , pages 53–56. IEEE, 2017

work page 2017
[46]

An empirical study of gpt-3 for few-shot knowledge-based vqa

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 3081–3089, 2022

work page 2022
[47]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems

work page
[48]

From recognition to cognition: Visual commonsense rea- soning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense rea- soning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019

work page 2019
[49]

Merlot reserve: Neu- ral script knowledge through vision and language and sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yan- peng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neu- ral script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022

work page 2022
[50]

Socratic models: Composing zero-shot multimodal reasoning with language,

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro- manski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022

work page arXiv 2022
[51]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022

work page 2022
[52]

Lit: Zero-shot transfer with locked-image text tuning

Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18123–18133, 2022

work page 2022
[53]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023

work page arXiv 2023
[54]

Vinvl: Making visual representations matter in vision- language models

Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Making visual representations matter in vision- language models. arXiv preprint arXiv:2101.00529, 1(6):8, 2021

work page arXiv 2021
[55]

Text as neural operator: Image manipulation by text instruction

Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, and Irfan Essa. Text as neural operator: Image manipulation by text instruction. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1893– 1902, 2021

work page 1902
[56]

Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

work page arXiv 2022
[57]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables com- plex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. A. Tool Details • Remove Something From The Photo: Model: “runwayml/stable-diffusion-inpainting” from Hu...

work page internal anchor Pith review Pith/arXiv arXiv 2022