arxiv: 2303.17580 · v4 · submitted 2023-03-30 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen , Kaitao Song , Xu Tan , Dongsheng Li , Weiming Lu , Yueting Zhuang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords large language modelsAI agentstask planningmulti-modal AImodel orchestrationHugging FaceChatGPT

0 comments

The pith

Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how large language models can act as controllers to manage and combine numerous existing AI models for solving complex tasks. The approach uses ChatGPT to break down a user's request into steps, choose suitable models from Hugging Face based on their descriptions, run those models on the subtasks, and then compile the final answer. A reader would care because this turns the many available specialized models into a flexible system capable of handling problems across text, images, audio, and more without building everything from scratch. It suggests a practical path to more capable AI by coordinating what already exists rather than training ever-larger single models.

Core claim

Based on the philosophy that language can serve as a generic interface, HuggingGPT leverages LLMs to connect various AI models to solve complicated AI tasks. Specifically, ChatGPT conducts task planning when receiving a user request, selects models according to their function descriptions available in Hugging Face, executes each subtask with the selected AI model, and summarizes the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains.

What carries the argument

The HuggingGPT agent, an LLM-powered system that performs task planning, model selection from function descriptions, subtask execution, and response summarization to connect specialized models.

If this is right

AI tasks spanning multiple modalities can be solved by coordinating existing models rather than creating new integrated systems.
Impressive performance is achieved on language, vision, speech, and other challenging tasks.
This method opens a new direction toward artificial general intelligence by using LLMs as managers of specialized tools.
Users can tackle sophisticated problems through natural language requests without needing expertise in each domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This orchestration method could extend to other model repositories, creating a more interconnected AI ecosystem.
Potential limitations in planning accuracy might be addressed by allowing the system to test multiple model options in parallel.
Similar controllers could be built for other ecosystems of AI tools, broadening the approach beyond one platform.

Load-bearing premise

ChatGPT can reliably perform task planning and select appropriate models from their function descriptions without frequent errors that break the overall solution.

What would settle it

Running the system on benchmark tasks with known correct model sequences and measuring the frequency of incorrect model selections or failed task completions.

read the original abstract

Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HuggingGPT outlines a usable LLM controller for Hugging Face models, but its performance claims lack supporting measurements.

read the letter

Hey, on the HuggingGPT paper: the main point is it turns ChatGPT into a controller that plans a task, picks the right Hugging Face models by reading their descriptions, runs them one by one, and then summarizes the results. The examples look decent for handling mixed tasks like vision plus language. What is new is the specific loop they built around existing models rather than training one big system. It shows language can act as the interface to stitch together a bunch of specialized tools, and they make it work for several domains without extra fine-tuning. That part is practical and easy to follow. They do a solid job describing how the selection step uses the model cards or descriptions on Hugging Face. The qualitative demos give readers something to try themselves. Where it falls short is the lack of any quantitative backing. No success percentages, no breakdown of where it fails, and no comparison to just prompting ChatGPT directly or using other routing methods. The assumption that the LLM will pick correctly every time is central, but untested in the paper, so the impressive results claim is hard to evaluate. This paper is for people already playing with LLM agents who want a ready example of multi-model coordination. A reader could implement something similar quickly from the description. I think it deserves peer review. The architecture is clear and the problem it tackles is real, even if the current evidence is mostly illustrative. Referees would likely ask for metrics, which would make it stronger.

Referee Report

2 major / 2 minor

Summary. The paper presents HuggingGPT, an LLM-powered agent that uses ChatGPT to conduct task planning on user requests, select models from Hugging Face according to their textual function descriptions, execute each subtask with the chosen model, and summarize the final response. It claims this architecture enables solving a wide range of sophisticated multi-modal AI tasks across language, vision, speech, and other domains, achieving impressive results and advancing toward AGI.

Significance. If the performance claims are substantiated, the work is significant for demonstrating a practical orchestration framework that treats language as a generic interface between LLMs and the existing ecosystem of specialized models on Hugging Face. This approach could reduce the need for end-to-end retraining and enable rapid composition of capabilities, with potential for reproducible agent systems if code and prompts are released.

major comments (2)

[Results] Results section: the central claim of 'impressive results' across modalities rests on qualitative examples only; no quantitative metrics, baselines, error rates, or benchmark suites are reported, leaving the breadth and reliability of the system unverified.
[Section 3] Task planning and model selection (Section 3): the assumption that ChatGPT reliably decomposes requests and maps them to correct HF models via descriptions is load-bearing for end-to-end success, yet no accuracy measurements, failure-mode analysis, or controlled tests of planning errors are provided.

minor comments (2)

[Abstract] The abstract and introduction repeat the phrase 'impressive results' without defining success criteria or providing even summary statistics.
[Figure 1] Figure 1 (system overview): the four-stage pipeline diagram would be clearer with explicit arrows and labels for the planning-selection-execution-summarization loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the evaluation and analysis sections.

read point-by-point responses

Referee: [Results] Results section: the central claim of 'impressive results' across modalities rests on qualitative examples only; no quantitative metrics, baselines, error rates, or benchmark suites are reported, leaving the breadth and reliability of the system unverified.

Authors: We agree that the current results section relies on qualitative demonstrations. The manuscript prioritizes showcasing the framework's ability to handle diverse multi-modal tasks via examples rather than exhaustive benchmarking, as standardized benchmarks for open-ended agent systems remain limited. In revision, we will add quantitative evaluation including success rates on a curated set of 100 user queries spanning modalities, failure case categorization, and discussion of why full baselines are challenging for this architecture. revision: yes
Referee: [Section 3] Task planning and model selection (Section 3): the assumption that ChatGPT reliably decomposes requests and maps them to correct HF models via descriptions is load-bearing for end-to-end success, yet no accuracy measurements, failure-mode analysis, or controlled tests of planning errors are provided.

Authors: We acknowledge the absence of isolated metrics for the planning and selection stages. The paper presents the end-to-end system with illustrative cases but does not isolate component accuracy. We will revise Section 3 to include controlled experiments: accuracy of task decomposition and model selection measured on 200 held-out queries, plus a failure-mode analysis (e.g., errors from ambiguous descriptions or over-decomposition). This will be added without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: system architecture description with no mathematical derivation or fitted predictions

full rationale

The paper is a system description of HuggingGPT, an LLM-powered agent that uses ChatGPT for task planning, model selection from Hugging Face descriptions, execution, and summarization. No equations, parameters, or derivations are present that could reduce by construction to inputs. Claims rest on the described pipeline and qualitative examples rather than any self-referential fits, uniqueness theorems, or self-citation chains. The central premise (LLMs as controllers) is presented as a philosophy motivating the system, not derived from prior self-citations or ansatzes. This is a standard engineering/systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical constants or fitted parameters. It rests on the domain assumption that current LLMs possess sufficient planning and reasoning ability to select and coordinate external models correctly.

axioms (1)

domain assumption Large language models such as ChatGPT exhibit strong capabilities in language understanding, generation, interaction, and reasoning sufficient for task planning and model selection.
This assumption is invoked in the abstract to justify using ChatGPT as the controller for the entire pipeline.

pith-pipeline@v0.9.0 · 5551 in / 1333 out tokens · 54010 ms · 2026-05-14T00:01:14.249105+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
cs.AI 2026-04 unverdicted novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Reflexion: Language Agents with Verbal Reinforcement Learning
cs.AI 2023-03 conditional novelty 7.0

Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
Why Does Agentic Safety Fail to Generalize Across Tasks?
cs.LG 2026-05 conditional novelty 6.0

Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
cs.CR 2026-02 unverdicted novelty 6.0

The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
cs.CV 2024-01 unverdicted novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
eess.AS 2023-11 unverdicted novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
cs.SE 2026-04 conditional novelty 5.0

Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
cs.CV 2023-04 conditional novelty 5.0

LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
A Self-Healing Framework for Reliable LLM-Based Autonomous Agents
cs.SE 2026-05 unverdicted novelty 4.0

A framework that monitors LLM agent behavior, assesses reliability, and automatically heals failures to raise task success rates in multi-agent workflows.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 25 Pith papers · 9 internal anchors

[1]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[2]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, and others. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Glm-130b: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. Glm-130b: An Open Bilingual Pre-trained Model. ICLR 2023 poster, 2023

work page 2023
[6]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and Efficient Foundation Language Models. ArXiv, abs/2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

An Explanation of In-context Learning as Implicit Bayesian Inference

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An Explanation of In-context Learning as Implicit Bayesian Inference. ICLR 2022 Poster, 2022

work page 2022
[8]

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2022

work page 2022
[9]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022

work page 2022
[10]

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan 11 Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Virendrabhai Purohit, Ishani Mondal, Jacob William Anderson, Kirby C. Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel...

work page 2022
[11]

S. Iyer, Xiaojuan Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O’Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, and Veselin Stoyanov. Opt- IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalizati...

work page arXiv 2022
[12]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Smith, Daniel Khashabi, and Hannaneh Hajishirzi

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions, 2022

work page 2022
[14]

W., Tay, Y ., Zhou, D., Le, Q

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V . Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. CoRR, abs/2301.13688, 2023

work page arXiv 2023
[15]

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[16]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are Zero-Shot Reasoners. In Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[17]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided Language Models. ArXiv, abs/2211.10435, 2022

work page Pith review arXiv 2022
[18]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 poster, abs/2203.11171, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. CoRR, abs/2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[21]

Flamingo: a visual language model for few-shot learning, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkow...

work page 2022
[22]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, S. Savarese, and Steven Hoi. Blip-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ArXiv, abs/2301.12597, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Y . Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, O. Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language Is Not All You Need: Aligning Perception with Language Models. ArXiv, abs/2302.14045, 2023

work page arXiv 2023
[24]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, M. Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Vipergpt: Visual inference via python execution for reasoning, 2023

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning, 2023

work page 2023
[26]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv, 2023

work page 2023
[27]

Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis, 2023

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis, 2023

work page 2023
[28]

Tool learning with foundation models, 2023

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

work page 2023
[29]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning (ICML), pages 12888–12900, 2022

work page 2022
[30]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. ArXiv, abs/2302.05543, 2023

work page arXiv 2023
[31]

Visual pro- gramming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual Programming: Compositional visual reasoning without training. arXiv, abs/2211.11559, 2022

work page arXiv 2022
[32]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

work page 2021
[33]

The illustrated image captioning using transformers

Ankur Kumar. The illustrated image captioning using transformers. ankur3107.github.io, 2022

work page 2022
[34]

End-to-end object detection with transformers, 2020

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020

work page 2020
[35]

Marzal and E

A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):926–932, 1993

work page 1993
[36]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

work page 2023
[37]

task": task,

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 13 A Appendix A.1 More details In this section, we will present more details about some designs of each stage in HuggingGP...

work page 2023
[38]

generated text

{"generated text": "a herd of giraffes and zebras grazing in a field"}

work page
[39]

generated image

{"generated image": "images/df8t.jpg"} Query #2: then based on the above canny image and a prompt "a photo of a zoo", generate a new image. Response #2: I have received your request to generate a canny image of a photo of a zoo. I have selected the model a lllyasviel/sd-controlnet-canny as it is the best model for this task. I have processed the image and...

work page
[40]

generated image

{"generated image": "images/ja2f.jpg"} Task Planning #2: 1: {"task": "canny-text-to-image", "id": 0, dep: [-1], "args": {"text":"a photo of a zoo", "image": "example.jpg"}} Execution Results #3:

work page
[41]

22 Query: Describe this image example.jpg in as much detail as possible

{'generated image with predicted box': '/images/dif2.jpg', 'predicted': [{'box': {'xmax': 760, 'xmin': 515, 'ymax': 437, 'ymin': 136}, 'label': 'giraffe', 'score': 0.9992127418518066}, {'box': {'xmax': 561, 'xmin': 510, 'ymax': 447, 'ymin': 357}, 'label': 'zebra', 'score': 0.9983904361724854}, {'box': {'xmax': 324, 'xmin': 178, 'ymax': 446, 'ymin': 336}, ...

work page
[42]

id": "0",

{"id": "0", "dep": [-1], "task": "image-to-text", "args": {"image": "example.jpg" }},

work page
[43]

id": "1",

{"id": "1", "dep": [-1], "task": "image-classification", "args": {"image": "example.jpg" }},

work page
[44]

id": "2",

{"id": "2", "dep": [-1], "task": "object-detection", "args": {"image": "example.jpg" }},

work page
[45]

id": "3",

{"id": "3", "dep": [-1], "task": "image- segmentation", "args": {"image": "example.jpg" }},

work page
[46]

id": "4",

{"id": "4", "dep": [-1], "task": "visual-question- answering", "args": {"text": "describe this image in detail", "image": "example.jpg" }}, Response: The image you provided appears to show a family of four dogs playing in the grass. For this task, I ran multiple CV tasks, including image-to- text, image-classification, object-detection, visual- question-a...

work page
[47]

generated_text

{"generated_text": "a family of four dogs are playing in the grass "}

work page
[48]

[{'score': 0.938717246055603, 'label': 'Rhodesian ridgeback'}, {'score': 0.01652258075773716, 'label': 'boxer'}, {'score': 0.006381669547408819, 'label': 'Great Dane'}, {'score': 0.006234415341168642, 'label': 'vizsla, Hungarian pointer'}, {'score': 0.005944834090769291, 'label': 'bull mastiff'}]

work page
[49]

generated image with predicted box

[{"generated image with predicted box": "/images/ sd48.jpg","predicted": [{"box": {"xmax": 463, "xmin": 373, "ymax": 267, "ymin": 199}, "label": "bus", "score": 0.9981155395507812}, {"box": {"xmax": 292, "xmin": 124, "ymax": 224, "ymin": 175}, "label": "airplane", "score": 0.9983609318733215}, {"box": {"xmax": 554, "xmin": 487, "ymax": 277, "ymin": 226}, ...

work page
[50]

generated image with segmentation

{"generated image with segmentation": "/images/ t4s1.jpg", "predicted": [{"score": 0.989, "label": "grass"}, {"score": 0.999, "label": "dog"}, {"score": 0.999, "label": "tree"},{"score": 0.999, "label": "dog"}]

work page
[51]

task": "object-detection

[{'answer': 'dogs', 'score': 0.8488452434539795}, {'answer': 'dog', 'score': 0.04168461635708809}] Figure 7: Case study on complex tasks (b). 23 Query: Can you tell me how many objects in the picture example1.jpg? Task Planning: 1: {"task": "object-detection", "id": 0, dep: [-1], "args": {"image": "example1.jpg"}}, 2: {"task": "image-to-text", "id": 1, de...

work page
[52]

generated image with predicted box

[{"generated image with predicted box": "/images/79f2.jpg", "predicted": [{"box": {"xmax": 463, "xmin": 373, "ymax": 267, "ymin": 199}, "label": "bus", "score": 0.9981155395507812}, {"box": {"xmax": 292, "xmin": 124, "ymax": 224, "ymin": 175}, "label": "airplane", "score": 0.9983609318733215}, {"box": {"xmax": 554, "xmin": 487, "ymax": 277, "ymin": 226}, ...

work page
[53]

generated_text

{"generated_text": "a large jetliner sitting on top of an airport tarmac "} Query: In example2.jpg, what's the animal and what's it doing? Task Planning: 1: {"task": "image-classification", "id": 0, dep: [-1], "args": {"image": "example2.jpg" }}, 2: {"task": "visual-question-answering", "id": 1, dep: [-1], "args": {"text": "What is the animal doing?", "im...

work page
[54]

[{'score': 0.8858247399330139, 'label': 'tiger, Panthera tigris'}, {'score': 0.10940514504909515, 'label': 'tiger cat'}, {'score': 0.0006216467591002584, 'label': 'jaguar, panther, Panthera onca, Felis onca'}, {'score': 0.0004262701258994639, 'label': 'dhole, Cuon alpinus'}, {'score': 0.00030842673731967807, 'label': 'lion, king of beasts, Panthera leo'}]

work page
[55]

[{'answer': 'sitting', 'score': 0.8294351696968079}, {'answer': 'laying down', 'score': 0.7387925386428833}, {'answer': 'resting', 'score': 0.625247061252594}, {'answer': 'laying', 'score': 0.6114301681518555}, {'answer': 'relaxing', 'score': 0.44740357995033264}]

work page
[56]

24 Query: based on the pose image of example1.jpg and the description of example2.jpg, please generate a new one

{'generated_text': 'a brown and white striped zebra laying on a rock'} Figure 8: Case study on complex tasks (c). 24 Query: based on the pose image of example1.jpg and the description of example2.jpg, please generate a new one. Response: Based on the pose image of example1.jpg and the description of example2.jpg, I have generated a new image s5f8.jpg for ...

work page
[57]

{'task': 'image-to-text', 'id': 1, 'dep': [-1], 'args': {'image': 'example2.jpg'}}

work page
[58]

task": "pose-text-to-image

{"task": "pose-text-to-image", "id": 2, "dep": [0, 1], "args": {"text":"<resource>-1", "image": "<resource>-0"}} <resource>-0 <resource>-1 sd-controlnet- openpose Figure 9: Qualitative analysis of multi-model cooperation with resource dependency. 25 Query: generate a video with the title "An astronaut is walking in space" and dub it. Task Planning: 1: {"t...

work page
[59]

generated text

{"generated text": "INVOICE NO : 18301/102/T0305"} Query: read the image example.jpg for me. Response: Sure. I have read the image example.jpg for you. The inference result is 'INVOICE NO : 18301/102/T0305'. The model used for this task is microsoft/trocr-base-printed, which is best suited for the task of image-to-text as it is fine-tuned on the SROIE dat...

work page
[60]

generated audio

{"generated audio": "/audios/3f9e.flac"} Execution Results:

work page
[61]

generated video

{"generated video": "/videos/5696.mp4"}

work page
[62]

generated audio

{"generated audio": "/audios/da5s.wav"} Figure 10: Qualitative analysis of multi-model cooperation on video and audio modalities. 26 Query: Given a collection of images A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these pictures? Response: In the collection of images A, B, and C, there are a total of 4 zebr...

work page