{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:BCWOJTQQ3S6CLOPQTE7P5E3XV2","short_pith_number":"pith:BCWOJTQQ","schema_version":"1.0","canonical_sha256":"08ace4ce10dcbc25b9f0993efe9377aeb5a54721c17707096d3b74388b8b909e","source":{"kind":"arxiv","id":"2303.17580","version":4},"attestation_state":"computed","paper":{"title":"HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.CL","authors_text":"Dongsheng Li, Kaitao Song, Weiming Lu, Xu Tan, Yongliang Shen, Yueting Zhuang","submitted_at":"2023-03-30T17:48:28Z","abstract_excerpt":"Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2303.17580","kind":"arxiv","version":4},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2023-03-30T17:48:28Z","cross_cats_sorted":["cs.AI","cs.CV","cs.LG"],"title_canon_sha256":"87c03022be7538b6640bc8aabeffa8e484de18c7a5814473f7060c4970dc07aa","abstract_canon_sha256":"62acc73266a99899bf6553ee8bb2946c3ab45cc71a2f7722c3a1de9f806cb9c1"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T03:27:44.464788Z","signature_b64":"lTqH03wI/gr0094vOXgJI7ts64tsY7TV793L/QLZO6ah0E/z4iecNkOSF+YlavcBaXzRjJGaEQ4M7W8iX1HZBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"08ace4ce10dcbc25b9f0993efe9377aeb5a54721c17707096d3b74388b8b909e","last_reissued_at":"2026-05-18T03:27:44.464076Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T03:27:44.464076Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.CL","authors_text":"Dongsheng Li, Kaitao Song, Weiming Lu, Xu Tan, Yongliang Shen, Yueting Zhuang","submitted_at":"2023-03-30T17:48:28Z","abstract_excerpt":"Solving complicated AI tasks with different domains and modalities is a key step toward artificial general intelligence. While there are numerous AI models available for various domains and modalities, they cannot handle complicated AI tasks autonomously. Considering large language models (LLMs) have exhibited exceptional abilities in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks, with language serving as a generic interface to empower this. Based on this philosophy, we "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That ChatGPT can reliably perform task planning and select appropriate models from their function descriptions without frequent errors that would break the overall solution.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ca1835c3c1181ef4f9924a49e3ed12cb696fc7bb4ab3e46f8f5c2706e0ceffc4"},"source":{"id":"2303.17580","kind":"arxiv","version":4},"verdict":{"id":"68396d14-6532-4211-b498-8cb30ea9fe0d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T00:01:14.249105Z","strongest_claim":"By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT can tackle a wide range of sophisticated AI tasks spanning different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks.","one_line_summary":"HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That ChatGPT can reliably perform task planning and select appropriate models from their function descriptions without frequent errors that would break the overall solution.","pith_extraction_headline":"Large language models like ChatGPT can coordinate existing AI models to solve sophisticated multi-modal tasks by planning and selecting them via language descriptions."},"references":{"count":62,"sample":[{"doi":"","year":2020,"title":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretch","work_id":"fc5ef596-f318-4694-b4e3-c27c7f607110","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","ref_index":2,"cited_arxiv_id":"2203.02155","is_internal_anchor":true},{"doi":"","year":2022,"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","ref_index":3,"cited_arxiv_id":"2204.02311","is_internal_anchor":true},{"doi":"","year":2022,"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","ref_index":4,"cited_arxiv_id":"2205.01068","is_internal_anchor":true},{"doi":"","year":2023,"title":"Glm-130b: An Open Bilingual Pre-trained Model","work_id":"b3223802-87d5-46d9-bf83-737d373c6060","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":62,"snapshot_sha256":"cc29953c37320229c695f1dd75058917d2da52f535108097c6e3f8f9d597cb1c","internal_anchors":12},"formal_canon":{"evidence_count":3,"snapshot_sha256":"99f80c9d9ecc4140de7eaf833bb4cbf6d5cf417471c5424de9b300eb0c5eadad"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2303.17580","created_at":"2026-05-18T03:27:44.464195+00:00"},{"alias_kind":"arxiv_version","alias_value":"2303.17580v4","created_at":"2026-05-18T03:27:44.464195+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2303.17580","created_at":"2026-05-18T03:27:44.464195+00:00"},{"alias_kind":"pith_short_12","alias_value":"BCWOJTQQ3S6C","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"BCWOJTQQ3S6CLOPQ","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"BCWOJTQQ","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":40,"internal_anchor_count":40,"sample":[{"citing_arxiv_id":"2304.14178","citing_title":"mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2407.08101","citing_title":"What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07358","citing_title":"A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16508","citing_title":"The Scaling Laws of Skills in LLM Agent Systems","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19099","citing_title":"DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18133","citing_title":"An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20173","citing_title":"A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2307.06435","citing_title":"A Comprehensive Overview of Large Language Models","ref_index":224,"is_internal_anchor":true},{"citing_arxiv_id":"2403.00476","citing_title":"TempCompass: Do Video LLMs Really Understand Videos?","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":183,"is_internal_anchor":true},{"citing_arxiv_id":"2309.17421","citing_title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","ref_index":114,"is_internal_anchor":true},{"citing_arxiv_id":"2306.05301","citing_title":"ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2306.06070","citing_title":"Mind2Web: Towards a Generalist Agent for the Web","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2306.06070","citing_title":"Mind2Web: Towards a Generalist Agent for the Web","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2305.17144","citing_title":"Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2304.15010","citing_title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2602.20867","citing_title":"SoK: Agentic Skills -- Beyond Tool Use in LLM Agents","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2311.10122","citing_title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13716","citing_title":"SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2305.06355","citing_title":"VideoChat: Chat-Centric Video Understanding","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14238","citing_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","ref_index":126,"is_internal_anchor":true},{"citing_arxiv_id":"2402.02716","citing_title":"Understanding the planning of LLM agents: A survey","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10774","citing_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","ref_index":107,"is_internal_anchor":true},{"citing_arxiv_id":"2410.23218","citing_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2311.07919","citing_title":"Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models","ref_index":31,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2","json":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2.json","graph_json":"https://pith.science/api/pith-number/BCWOJTQQ3S6CLOPQTE7P5E3XV2/graph.json","events_json":"https://pith.science/api/pith-number/BCWOJTQQ3S6CLOPQTE7P5E3XV2/events.json","paper":"https://pith.science/paper/BCWOJTQQ"},"agent_actions":{"view_html":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2","download_json":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2.json","view_paper":"https://pith.science/paper/BCWOJTQQ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2303.17580&json=true","fetch_graph":"https://pith.science/api/pith-number/BCWOJTQQ3S6CLOPQTE7P5E3XV2/graph.json","fetch_events":"https://pith.science/api/pith-number/BCWOJTQQ3S6CLOPQTE7P5E3XV2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2/action/storage_attestation","attest_author":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2/action/author_attestation","sign_citation":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2/action/citation_signature","submit_replication":"https://pith.science/pith/BCWOJTQQ3S6CLOPQTE7P5E3XV2/action/replication_record"}},"created_at":"2026-05-18T03:27:44.464195+00:00","updated_at":"2026-05-18T03:27:44.464195+00:00"}