pith. machine review for the scientific record. sign in

arxiv: 2303.17580 · v4 · submitted 2023-03-30 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG
keywords large language modelsAI agentstask planningmulti-modal AImodel orchestrationHugging FaceChatGPT
0
0 comments X

The pith

Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how large language models can act as controllers to manage and combine numerous existing AI models for solving complex tasks. The approach uses ChatGPT to break down a user's request into steps, choose suitable models from Hugging Face based on their descriptions, run those models on the subtasks, and then compile the final answer. A reader would care because this turns the many available specialized models into a flexible system capable of handling problems across text, images, audio, and more without building everything from scratch. It suggests a practical path to more capable AI by coordinating what already exists rather than training ever-larger single models.

Core claim

Based on the philosophy that language can serve as a generic interface, HuggingGPT leverages LLMs to connect various AI models to solve complicated AI tasks. Specifically, ChatGPT conducts task planning when receiving a user request, selects models according to their function descriptions available in Hugging Face, executes each subtask with the selected AI model, and summarizes the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains.

What carries the argument

The HuggingGPT agent, an LLM-powered system that performs task planning, model selection from function descriptions, subtask execution, and response summarization to connect specialized models.

If this is right

  • AI tasks spanning multiple modalities can be solved by coordinating existing models rather than creating new integrated systems.
  • Impressive performance is achieved on language, vision, speech, and other challenging tasks.
  • This method opens a new direction toward artificial general intelligence by using LLMs as managers of specialized tools.
  • Users can tackle sophisticated problems through natural language requests without needing expertise in each domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This orchestration method could extend to other model repositories, creating a more interconnected AI ecosystem.
  • Potential limitations in planning accuracy might be addressed by allowing the system to test multiple model options in parallel.
  • Similar controllers could be built for other ecosystems of AI tools, broadening the approach beyond one platform.

Load-bearing premise

ChatGPT can reliably perform task planning and select appropriate models from their function descriptions without frequent errors that break the overall solution.

What would settle it

Running the system on benchmark tasks with known correct model sequences and measuring the frequency of incorrect model selections or failed task completions.

read the original abstract

Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents HuggingGPT, an LLM-powered agent that uses ChatGPT to conduct task planning on user requests, select models from Hugging Face according to their textual function descriptions, execute each subtask with the chosen model, and summarize the final response. It claims this architecture enables solving a wide range of sophisticated multi-modal AI tasks across language, vision, speech, and other domains, achieving impressive results and advancing toward AGI.

Significance. If the performance claims are substantiated, the work is significant for demonstrating a practical orchestration framework that treats language as a generic interface between LLMs and the existing ecosystem of specialized models on Hugging Face. This approach could reduce the need for end-to-end retraining and enable rapid composition of capabilities, with potential for reproducible agent systems if code and prompts are released.

major comments (2)
  1. [Results] Results section: the central claim of 'impressive results' across modalities rests on qualitative examples only; no quantitative metrics, baselines, error rates, or benchmark suites are reported, leaving the breadth and reliability of the system unverified.
  2. [Section 3] Task planning and model selection (Section 3): the assumption that ChatGPT reliably decomposes requests and maps them to correct HF models via descriptions is load-bearing for end-to-end success, yet no accuracy measurements, failure-mode analysis, or controlled tests of planning errors are provided.
minor comments (2)
  1. [Abstract] The abstract and introduction repeat the phrase 'impressive results' without defining success criteria or providing even summary statistics.
  2. [Figure 1] Figure 1 (system overview): the four-stage pipeline diagram would be clearer with explicit arrows and labels for the planning-selection-execution-summarization loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the evaluation and analysis sections.

read point-by-point responses
  1. Referee: [Results] Results section: the central claim of 'impressive results' across modalities rests on qualitative examples only; no quantitative metrics, baselines, error rates, or benchmark suites are reported, leaving the breadth and reliability of the system unverified.

    Authors: We agree that the current results section relies on qualitative demonstrations. The manuscript prioritizes showcasing the framework's ability to handle diverse multi-modal tasks via examples rather than exhaustive benchmarking, as standardized benchmarks for open-ended agent systems remain limited. In revision, we will add quantitative evaluation including success rates on a curated set of 100 user queries spanning modalities, failure case categorization, and discussion of why full baselines are challenging for this architecture. revision: yes

  2. Referee: [Section 3] Task planning and model selection (Section 3): the assumption that ChatGPT reliably decomposes requests and maps them to correct HF models via descriptions is load-bearing for end-to-end success, yet no accuracy measurements, failure-mode analysis, or controlled tests of planning errors are provided.

    Authors: We acknowledge the absence of isolated metrics for the planning and selection stages. The paper presents the end-to-end system with illustrative cases but does not isolate component accuracy. We will revise Section 3 to include controlled experiments: accuracy of task decomposition and model selection measured on 200 held-out queries, plus a failure-mode analysis (e.g., errors from ambiguous descriptions or over-decomposition). This will be added without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: system architecture description with no mathematical derivation or fitted predictions

full rationale

The paper is a system description of HuggingGPT, an LLM-powered agent that uses ChatGPT for task planning, model selection from Hugging Face descriptions, execution, and summarization. No equations, parameters, or derivations are present that could reduce by construction to inputs. Claims rest on the described pipeline and qualitative examples rather than any self-referential fits, uniqueness theorems, or self-citation chains. The central premise (LLMs as controllers) is presented as a philosophy motivating the system, not derived from prior self-citations or ansatzes. This is a standard engineering/systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical constants or fitted parameters. It rests on the domain assumption that current LLMs possess sufficient planning and reasoning ability to select and coordinate external models correctly.

axioms (1)
  • domain assumption Large language models such as ChatGPT exhibit strong capabilities in language understanding, generation, interaction, and reasoning sufficient for task planning and model selection.
    This assumption is invoked in the abstract to justify using ChatGPT as the controller for the entire pipeline.

pith-pipeline@v0.9.0 · 5551 in / 1333 out tokens · 54010 ms · 2026-05-14T00:01:14.249105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    cs.SE 2026-05 unverdicted novelty 7.0

    SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...

  2. From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

    cs.LG 2026-05 unverdicted novelty 7.0

    AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...

  3. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  4. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  5. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

    cs.AI 2026-04 unverdicted novelty 7.0

    IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

  6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  7. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  8. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  9. Reflexion: Language Agents with Verbal Reinforcement Learning

    cs.AI 2023-03 conditional novelty 7.0

    Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.

  10. Why Does Agentic Safety Fail to Generalize Across Tasks?

    cs.LG 2026-05 conditional novelty 6.0

    Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...

  11. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  12. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  13. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    cs.CV 2024-01 unverdicted novelty 6.0

    Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.

  14. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  15. Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    eess.AS 2023-11 unverdicted novelty 6.0

    Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

  16. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  17. Gorilla: Large Language Model Connected with Massive APIs

    cs.CL 2023-05 conditional novelty 6.0

    Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.

  18. Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

    cs.SE 2026-04 conditional novelty 5.0

    Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

  19. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  20. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    cs.CV 2023-04 conditional novelty 5.0

    LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.

  21. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  22. A Self-Healing Framework for Reliable LLM-Based Autonomous Agents

    cs.SE 2026-05 unverdicted novelty 4.0

    A framework that monitors LLM agent behavior, assesses reliability, and automatically heals failures to raise task success rates in multi-agent workflows.

  23. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

  24. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  25. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 25 Pith papers · 9 internal anchors

  1. [1]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  2. [2]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, and others. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022

  4. [4]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068, 2022

  5. [5]

    Glm-130b: An Open Bilingual Pre-trained Model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. Glm-130b: An Open Bilingual Pre-trained Model. ICLR 2023 poster, 2023

  6. [6]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and Efficient Foundation Language Models. ArXiv, abs/2302.13971, 2023

  7. [7]

    An Explanation of In-context Learning as Implicit Bayesian Inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An Explanation of In-context Learning as Implicit Bayesian Inference. ICLR 2022 Poster, 2022

  8. [8]

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2022

  9. [9]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022

  10. [10]

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan 11 Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Virendrabhai Purohit, Ishani Mondal, Jacob William Anderson, Kirby C. Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel...

  11. [11]

    S. Iyer, Xiaojuan Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O’Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, and Veselin Stoyanov. Opt- IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalizati...

  12. [12]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...

  13. [13]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions, 2022

  14. [14]

    W., Tay, Y ., Zhou, D., Le, Q

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V . Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. CoRR, abs/2301.13688, 2023

  15. [15]

    Chain of Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems (NeurIPS), 2022

  16. [16]

    Large Language Models are Zero-Shot Reasoners

    Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are Zero-Shot Reasoners. In Conference on Neural Information Processing Systems (NeurIPS), 2022

  17. [17]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided Language Models. ArXiv, abs/2211.10435, 2022

  18. [18]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 poster, abs/2203.11171, 2023

  19. [19]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. CoRR, abs/2206.07682, 2022

  20. [20]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  21. [21]

    Flamingo: a visual language model for few-shot learning, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkow...

  22. [22]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, S. Savarese, and Steven Hoi. Blip-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ArXiv, abs/2301.12597, 2023. 12

  23. [23]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Y . Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, O. Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language Is Not All You Need: Aligning Perception with Language Models. ArXiv, abs/2302.14045, 2023

  24. [24]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, M. Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761, 2023

  25. [25]

    Vipergpt: Visual inference via python execution for reasoning, 2023

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning, 2023

  26. [26]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv, 2023

  27. [27]

    Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis, 2023

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis, 2023

  28. [28]

    Tool learning with foundation models, 2023

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

  29. [29]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning (ICML), pages 12888–12900, 2022

  30. [30]

    Adding conditional control to text-to-image diffusion models,

    Lvmin Zhang and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. ArXiv, abs/2302.05543, 2023

  31. [31]

    Visual pro- gramming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual Programming: Compositional visual reasoning without training. arXiv, abs/2211.11559, 2022

  32. [32]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

  33. [33]

    The illustrated image captioning using transformers

    Ankur Kumar. The illustrated image captioning using transformers. ankur3107.github.io, 2022

  34. [34]

    End-to-end object detection with transformers, 2020

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020

  35. [35]

    Marzal and E

    A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):926–932, 1993

  36. [36]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

  37. [37]

    task": task,

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 13 A Appendix A.1 More details In this section, we will present more details about some designs of each stage in HuggingGP...

  38. [38]

    generated text

    {"generated text": "a herd of giraffes and zebras grazing in a field"}

  39. [39]

    generated image

    {"generated image": "images/df8t.jpg"} Query #2: then based on the above canny image and a prompt "a photo of a zoo", generate a new image. Response #2: I have received your request to generate a canny image of a photo of a zoo. I have selected the model a lllyasviel/sd-controlnet-canny as it is the best model for this task. I have processed the image and...

  40. [40]

    generated image

    {"generated image": "images/ja2f.jpg"} Task Planning #2: 1: {"task": "canny-text-to-image", "id": 0, dep: [-1], "args": {"text":"a photo of a zoo", "image": "example.jpg"}} Execution Results #3:

  41. [41]

    22 Query: Describe this image example.jpg in as much detail as possible

    {'generated image with predicted box': '/images/dif2.jpg', 'predicted': [{'box': {'xmax': 760, 'xmin': 515, 'ymax': 437, 'ymin': 136}, 'label': 'giraffe', 'score': 0.9992127418518066}, {'box': {'xmax': 561, 'xmin': 510, 'ymax': 447, 'ymin': 357}, 'label': 'zebra', 'score': 0.9983904361724854}, {'box': {'xmax': 324, 'xmin': 178, 'ymax': 446, 'ymin': 336}, ...

  42. [42]

    id": "0",

    {"id": "0", "dep": [-1], "task": "image-to-text", "args": {"image": "example.jpg" }},

  43. [43]

    id": "1",

    {"id": "1", "dep": [-1], "task": "image-classification", "args": {"image": "example.jpg" }},

  44. [44]

    id": "2",

    {"id": "2", "dep": [-1], "task": "object-detection", "args": {"image": "example.jpg" }},

  45. [45]

    id": "3",

    {"id": "3", "dep": [-1], "task": "image- segmentation", "args": {"image": "example.jpg" }},

  46. [46]

    id": "4",

    {"id": "4", "dep": [-1], "task": "visual-question- answering", "args": {"text": "describe this image in detail", "image": "example.jpg" }}, Response: The image you provided appears to show a family of four dogs playing in the grass. For this task, I ran multiple CV tasks, including image-to- text, image-classification, object-detection, visual- question-a...

  47. [47]

    generated_text

    {"generated_text": "a family of four dogs are playing in the grass "}

  48. [48]

    [{'score': 0.938717246055603, 'label': 'Rhodesian ridgeback'}, {'score': 0.01652258075773716, 'label': 'boxer'}, {'score': 0.006381669547408819, 'label': 'Great Dane'}, {'score': 0.006234415341168642, 'label': 'vizsla, Hungarian pointer'}, {'score': 0.005944834090769291, 'label': 'bull mastiff'}]

  49. [49]

    generated image with predicted box

    [{"generated image with predicted box": "/images/ sd48.jpg","predicted": [{"box": {"xmax": 463, "xmin": 373, "ymax": 267, "ymin": 199}, "label": "bus", "score": 0.9981155395507812}, {"box": {"xmax": 292, "xmin": 124, "ymax": 224, "ymin": 175}, "label": "airplane", "score": 0.9983609318733215}, {"box": {"xmax": 554, "xmin": 487, "ymax": 277, "ymin": 226}, ...

  50. [50]

    generated image with segmentation

    {"generated image with segmentation": "/images/ t4s1.jpg", "predicted": [{"score": 0.989, "label": "grass"}, {"score": 0.999, "label": "dog"}, {"score": 0.999, "label": "tree"},{"score": 0.999, "label": "dog"}]

  51. [51]

    task": "object-detection

    [{'answer': 'dogs', 'score': 0.8488452434539795}, {'answer': 'dog', 'score': 0.04168461635708809}] Figure 7: Case study on complex tasks (b). 23 Query: Can you tell me how many objects in the picture example1.jpg? Task Planning: 1: {"task": "object-detection", "id": 0, dep: [-1], "args": {"image": "example1.jpg"}}, 2: {"task": "image-to-text", "id": 1, de...

  52. [52]

    generated image with predicted box

    [{"generated image with predicted box": "/images/79f2.jpg", "predicted": [{"box": {"xmax": 463, "xmin": 373, "ymax": 267, "ymin": 199}, "label": "bus", "score": 0.9981155395507812}, {"box": {"xmax": 292, "xmin": 124, "ymax": 224, "ymin": 175}, "label": "airplane", "score": 0.9983609318733215}, {"box": {"xmax": 554, "xmin": 487, "ymax": 277, "ymin": 226}, ...

  53. [53]

    generated_text

    {"generated_text": "a large jetliner sitting on top of an airport tarmac "} Query: In example2.jpg, what's the animal and what's it doing? Task Planning: 1: {"task": "image-classification", "id": 0, dep: [-1], "args": {"image": "example2.jpg" }}, 2: {"task": "visual-question-answering", "id": 1, dep: [-1], "args": {"text": "What is the animal doing?", "im...

  54. [54]

    [{'score': 0.8858247399330139, 'label': 'tiger, Panthera tigris'}, {'score': 0.10940514504909515, 'label': 'tiger cat'}, {'score': 0.0006216467591002584, 'label': 'jaguar, panther, Panthera onca, Felis onca'}, {'score': 0.0004262701258994639, 'label': 'dhole, Cuon alpinus'}, {'score': 0.00030842673731967807, 'label': 'lion, king of beasts, Panthera leo'}]

  55. [55]

    [{'answer': 'sitting', 'score': 0.8294351696968079}, {'answer': 'laying down', 'score': 0.7387925386428833}, {'answer': 'resting', 'score': 0.625247061252594}, {'answer': 'laying', 'score': 0.6114301681518555}, {'answer': 'relaxing', 'score': 0.44740357995033264}]

  56. [56]

    24 Query: based on the pose image of example1.jpg and the description of example2.jpg, please generate a new one

    {'generated_text': 'a brown and white striped zebra laying on a rock'} Figure 8: Case study on complex tasks (c). 24 Query: based on the pose image of example1.jpg and the description of example2.jpg, please generate a new one. Response: Based on the pose image of example1.jpg and the description of example2.jpg, I have generated a new image s5f8.jpg for ...

  57. [57]

    {'task': 'image-to-text', 'id': 1, 'dep': [-1], 'args': {'image': 'example2.jpg'}}

  58. [58]

    task": "pose-text-to-image

    {"task": "pose-text-to-image", "id": 2, "dep": [0, 1], "args": {"text":"<resource>-1", "image": "<resource>-0"}} <resource>-0 <resource>-1 sd-controlnet- openpose Figure 9: Qualitative analysis of multi-model cooperation with resource dependency. 25 Query: generate a video with the title "An astronaut is walking in space" and dub it. Task Planning: 1: {"t...

  59. [59]

    generated text

    {"generated text": "INVOICE NO : 18301/102/T0305"} Query: read the image example.jpg for me. Response: Sure. I have read the image example.jpg for you. The inference result is 'INVOICE NO : 18301/102/T0305'. The model used for this task is microsoft/trocr-base-printed, which is best suited for the task of image-to-text as it is fine-tuned on the SROIE dat...

  60. [60]

    generated audio

    {"generated audio": "/audios/3f9e.flac"} Execution Results:

  61. [61]

    generated video

    {"generated video": "/videos/5696.mp4"}

  62. [62]

    generated audio

    {"generated audio": "/audios/da5s.wav"} Figure 10: Qualitative analysis of multi-model cooperation on video and audio modalities. 26 Query: Given a collection of images A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these pictures? Response: In the collection of images A, B, and C, there are a total of 4 zebr...