Recognition: 3 theorem links
· Lean TheoremHuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Pith reviewed 2026-05-14 00:01 UTC · model grok-4.3
The pith
Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Based on the philosophy that language can serve as a generic interface, HuggingGPT leverages LLMs to connect various AI models to solve complicated AI tasks. Specifically, ChatGPT conducts task planning when receiving a user request, selects models according to their function descriptions available in Hugging Face, executes each subtask with the selected AI model, and summarizes the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains.
What carries the argument
The HuggingGPT agent, an LLM-powered system that performs task planning, model selection from function descriptions, subtask execution, and response summarization to connect specialized models.
If this is right
- AI tasks spanning multiple modalities can be solved by coordinating existing models rather than creating new integrated systems.
- Impressive performance is achieved on language, vision, speech, and other challenging tasks.
- This method opens a new direction toward artificial general intelligence by using LLMs as managers of specialized tools.
- Users can tackle sophisticated problems through natural language requests without needing expertise in each domain.
Where Pith is reading between the lines
- This orchestration method could extend to other model repositories, creating a more interconnected AI ecosystem.
- Potential limitations in planning accuracy might be addressed by allowing the system to test multiple model options in parallel.
- Similar controllers could be built for other ecosystems of AI tools, broadening the approach beyond one platform.
Load-bearing premise
ChatGPT can reliably perform task planning and select appropriate models from their function descriptions without frequent errors that break the overall solution.
What would settle it
Running the system on benchmark tasks with known correct model sequences and measuring the frequency of incorrect model selections or failed task completions.
read the original abstract
Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we present HuggingGPT, an LLM-powered agent that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards the realization of artificial general intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HuggingGPT, an LLM-powered agent that uses ChatGPT to conduct task planning on user requests, select models from Hugging Face according to their textual function descriptions, execute each subtask with the chosen model, and summarize the final response. It claims this architecture enables solving a wide range of sophisticated multi-modal AI tasks across language, vision, speech, and other domains, achieving impressive results and advancing toward AGI.
Significance. If the performance claims are substantiated, the work is significant for demonstrating a practical orchestration framework that treats language as a generic interface between LLMs and the existing ecosystem of specialized models on Hugging Face. This approach could reduce the need for end-to-end retraining and enable rapid composition of capabilities, with potential for reproducible agent systems if code and prompts are released.
major comments (2)
- [Results] Results section: the central claim of 'impressive results' across modalities rests on qualitative examples only; no quantitative metrics, baselines, error rates, or benchmark suites are reported, leaving the breadth and reliability of the system unverified.
- [Section 3] Task planning and model selection (Section 3): the assumption that ChatGPT reliably decomposes requests and maps them to correct HF models via descriptions is load-bearing for end-to-end success, yet no accuracy measurements, failure-mode analysis, or controlled tests of planning errors are provided.
minor comments (2)
- [Abstract] The abstract and introduction repeat the phrase 'impressive results' without defining success criteria or providing even summary statistics.
- [Figure 1] Figure 1 (system overview): the four-stage pipeline diagram would be clearer with explicit arrows and labels for the planning-selection-execution-summarization loop.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the evaluation and analysis sections.
read point-by-point responses
-
Referee: [Results] Results section: the central claim of 'impressive results' across modalities rests on qualitative examples only; no quantitative metrics, baselines, error rates, or benchmark suites are reported, leaving the breadth and reliability of the system unverified.
Authors: We agree that the current results section relies on qualitative demonstrations. The manuscript prioritizes showcasing the framework's ability to handle diverse multi-modal tasks via examples rather than exhaustive benchmarking, as standardized benchmarks for open-ended agent systems remain limited. In revision, we will add quantitative evaluation including success rates on a curated set of 100 user queries spanning modalities, failure case categorization, and discussion of why full baselines are challenging for this architecture. revision: yes
-
Referee: [Section 3] Task planning and model selection (Section 3): the assumption that ChatGPT reliably decomposes requests and maps them to correct HF models via descriptions is load-bearing for end-to-end success, yet no accuracy measurements, failure-mode analysis, or controlled tests of planning errors are provided.
Authors: We acknowledge the absence of isolated metrics for the planning and selection stages. The paper presents the end-to-end system with illustrative cases but does not isolate component accuracy. We will revise Section 3 to include controlled experiments: accuracy of task decomposition and model selection measured on 200 held-out queries, plus a failure-mode analysis (e.g., errors from ambiguous descriptions or over-decomposition). This will be added without altering the core claims. revision: yes
Circularity Check
No circularity: system architecture description with no mathematical derivation or fitted predictions
full rationale
The paper is a system description of HuggingGPT, an LLM-powered agent that uses ChatGPT for task planning, model selection from Hugging Face descriptions, execution, and summarization. No equations, parameters, or derivations are present that could reduce by construction to inputs. Claims rest on the described pipeline and qualitative examples rather than any self-referential fits, uniqueness theorems, or self-citation chains. The central premise (LLMs as controllers) is presented as a philosophy motivating the system, not derived from prior self-citations or ansatzes. This is a standard engineering/systems paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models such as ChatGPT exhibit strong capabilities in language understanding, generation, interaction, and reasoning sufficient for task planning and model selection.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
A Self-Healing Framework for Reliable LLM-Based Autonomous Agents
A framework that monitors LLM agent behavior, assesses reliability, and automatically heals failures to raise task success rates in multi-agent workflows.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[2]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, and others. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open Pre-trained Transformer Language Models. ArXiv, abs/2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Glm-130b: An Open Bilingual Pre-trained Model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. Glm-130b: An Open Bilingual Pre-trained Model. ICLR 2023 poster, 2023
work page 2023
-
[6]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and Efficient Foundation Language Models. ArXiv, abs/2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An Explanation of In-context Learning as Implicit Bayesian Inference. ICLR 2022 Poster, 2022
work page 2022
-
[8]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2022
work page 2022
-
[9]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022
work page 2022
-
[10]
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan 11 Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Virendrabhai Purohit, Ishani Mondal, Jacob William Anderson, Kirby C. Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel...
work page 2022
-
[11]
S. Iyer, Xiaojuan Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O’Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, and Veselin Stoyanov. Opt- IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalizati...
-
[12]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y . Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jef...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions, 2022
work page 2022
-
[14]
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V . Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. CoRR, abs/2301.13688, 2023
-
[15]
Chain of Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain of Thought Prompting Elicits Reasoning in Large Language Models. In Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[16]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are Zero-Shot Reasoners. In Conference on Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[17]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided Language Models. ArXiv, abs/2211.10435, 2022
work page Pith review arXiv 2022
-
[18]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 poster, abs/2203.11171, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. CoRR, abs/2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [20]
-
[21]
Flamingo: a visual language model for few-shot learning, 2022
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkow...
work page 2022
-
[22]
Junnan Li, Dongxu Li, S. Savarese, and Steven Hoi. Blip-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ArXiv, abs/2301.12597, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Y . Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, O. Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language Is Not All You Need: Aligning Perception with Language Models. ArXiv, abs/2302.14045, 2023
-
[24]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, M. Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach Themselves to Use Tools. ArXiv, abs/2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Vipergpt: Visual inference via python execution for reasoning, 2023
Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning, 2023
work page 2023
-
[26]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv, 2023
work page 2023
-
[27]
Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis, 2023
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis, 2023
work page 2023
-
[28]
Tool learning with foundation models, 2023
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...
work page 2023
-
[29]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Blip: Bootstrapping Language- Image Pre-training for Unified Vision-Language Understanding and Generation. InInternational Conference on Machine Learning (ICML), pages 12888–12900, 2022
work page 2022
-
[30]
Adding conditional control to text-to-image diffusion models,
Lvmin Zhang and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. ArXiv, abs/2302.05543, 2023
-
[31]
Visual pro- gramming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual Programming: Compositional visual reasoning without training. arXiv, abs/2211.11559, 2022
-
[32]
An image is worth 16x16 words: Transformers for image recognition at scale, 2021
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021
work page 2021
-
[33]
The illustrated image captioning using transformers
Ankur Kumar. The illustrated image captioning using transformers. ankur3107.github.io, 2022
work page 2022
-
[34]
End-to-end object detection with transformers, 2020
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020
work page 2020
-
[35]
A. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):926–932, 1993
work page 1993
-
[36]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023
work page 2023
-
[37]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 13 A Appendix A.1 More details In this section, we will present more details about some designs of each stage in HuggingGP...
work page 2023
- [38]
-
[39]
{"generated image": "images/df8t.jpg"} Query #2: then based on the above canny image and a prompt "a photo of a zoo", generate a new image. Response #2: I have received your request to generate a canny image of a photo of a zoo. I have selected the model a lllyasviel/sd-controlnet-canny as it is the best model for this task. I have processed the image and...
-
[40]
{"generated image": "images/ja2f.jpg"} Task Planning #2: 1: {"task": "canny-text-to-image", "id": 0, dep: [-1], "args": {"text":"a photo of a zoo", "image": "example.jpg"}} Execution Results #3:
-
[41]
22 Query: Describe this image example.jpg in as much detail as possible
{'generated image with predicted box': '/images/dif2.jpg', 'predicted': [{'box': {'xmax': 760, 'xmin': 515, 'ymax': 437, 'ymin': 136}, 'label': 'giraffe', 'score': 0.9992127418518066}, {'box': {'xmax': 561, 'xmin': 510, 'ymax': 447, 'ymin': 357}, 'label': 'zebra', 'score': 0.9983904361724854}, {'box': {'xmax': 324, 'xmin': 178, 'ymax': 446, 'ymin': 336}, ...
- [42]
- [43]
- [44]
- [45]
-
[46]
{"id": "4", "dep": [-1], "task": "visual-question- answering", "args": {"text": "describe this image in detail", "image": "example.jpg" }}, Response: The image you provided appears to show a family of four dogs playing in the grass. For this task, I ran multiple CV tasks, including image-to- text, image-classification, object-detection, visual- question-a...
- [47]
-
[48]
[{'score': 0.938717246055603, 'label': 'Rhodesian ridgeback'}, {'score': 0.01652258075773716, 'label': 'boxer'}, {'score': 0.006381669547408819, 'label': 'Great Dane'}, {'score': 0.006234415341168642, 'label': 'vizsla, Hungarian pointer'}, {'score': 0.005944834090769291, 'label': 'bull mastiff'}]
-
[49]
generated image with predicted box
[{"generated image with predicted box": "/images/ sd48.jpg","predicted": [{"box": {"xmax": 463, "xmin": 373, "ymax": 267, "ymin": 199}, "label": "bus", "score": 0.9981155395507812}, {"box": {"xmax": 292, "xmin": 124, "ymax": 224, "ymin": 175}, "label": "airplane", "score": 0.9983609318733215}, {"box": {"xmax": 554, "xmin": 487, "ymax": 277, "ymin": 226}, ...
-
[50]
generated image with segmentation
{"generated image with segmentation": "/images/ t4s1.jpg", "predicted": [{"score": 0.989, "label": "grass"}, {"score": 0.999, "label": "dog"}, {"score": 0.999, "label": "tree"},{"score": 0.999, "label": "dog"}]
-
[51]
[{'answer': 'dogs', 'score': 0.8488452434539795}, {'answer': 'dog', 'score': 0.04168461635708809}] Figure 7: Case study on complex tasks (b). 23 Query: Can you tell me how many objects in the picture example1.jpg? Task Planning: 1: {"task": "object-detection", "id": 0, dep: [-1], "args": {"image": "example1.jpg"}}, 2: {"task": "image-to-text", "id": 1, de...
-
[52]
generated image with predicted box
[{"generated image with predicted box": "/images/79f2.jpg", "predicted": [{"box": {"xmax": 463, "xmin": 373, "ymax": 267, "ymin": 199}, "label": "bus", "score": 0.9981155395507812}, {"box": {"xmax": 292, "xmin": 124, "ymax": 224, "ymin": 175}, "label": "airplane", "score": 0.9983609318733215}, {"box": {"xmax": 554, "xmin": 487, "ymax": 277, "ymin": 226}, ...
-
[53]
{"generated_text": "a large jetliner sitting on top of an airport tarmac "} Query: In example2.jpg, what's the animal and what's it doing? Task Planning: 1: {"task": "image-classification", "id": 0, dep: [-1], "args": {"image": "example2.jpg" }}, 2: {"task": "visual-question-answering", "id": 1, dep: [-1], "args": {"text": "What is the animal doing?", "im...
-
[54]
[{'score': 0.8858247399330139, 'label': 'tiger, Panthera tigris'}, {'score': 0.10940514504909515, 'label': 'tiger cat'}, {'score': 0.0006216467591002584, 'label': 'jaguar, panther, Panthera onca, Felis onca'}, {'score': 0.0004262701258994639, 'label': 'dhole, Cuon alpinus'}, {'score': 0.00030842673731967807, 'label': 'lion, king of beasts, Panthera leo'}]
-
[55]
[{'answer': 'sitting', 'score': 0.8294351696968079}, {'answer': 'laying down', 'score': 0.7387925386428833}, {'answer': 'resting', 'score': 0.625247061252594}, {'answer': 'laying', 'score': 0.6114301681518555}, {'answer': 'relaxing', 'score': 0.44740357995033264}]
-
[56]
{'generated_text': 'a brown and white striped zebra laying on a rock'} Figure 8: Case study on complex tasks (c). 24 Query: based on the pose image of example1.jpg and the description of example2.jpg, please generate a new one. Response: Based on the pose image of example1.jpg and the description of example2.jpg, I have generated a new image s5f8.jpg for ...
-
[57]
{'task': 'image-to-text', 'id': 1, 'dep': [-1], 'args': {'image': 'example2.jpg'}}
-
[58]
{"task": "pose-text-to-image", "id": 2, "dep": [0, 1], "args": {"text":"<resource>-1", "image": "<resource>-0"}} <resource>-0 <resource>-1 sd-controlnet- openpose Figure 9: Qualitative analysis of multi-model cooperation with resource dependency. 25 Query: generate a video with the title "An astronaut is walking in space" and dub it. Task Planning: 1: {"t...
-
[59]
{"generated text": "INVOICE NO : 18301/102/T0305"} Query: read the image example.jpg for me. Response: Sure. I have read the image example.jpg for you. The inference result is 'INVOICE NO : 18301/102/T0305'. The model used for this task is microsoft/trocr-base-printed, which is best suited for the task of image-to-text as it is fine-tuned on the SROIE dat...
- [60]
- [61]
-
[62]
{"generated audio": "/audios/da5s.wav"} Figure 10: Qualitative analysis of multi-model cooperation on video and audio modalities. 26 Query: Given a collection of images A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these pictures? Response: In the collection of images A, B, and C, there are a total of 4 zebr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.