{"paper":{"title":"Instruction Tuning with GPT-4","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"GPT-4 generated instruction data enables LLaMA models to reach higher zero-shot performance on new tasks than earlier synthetic datasets.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Baolin Peng, Chunyuan Li, Jianfeng Gao, Michel Galley, Pengcheng He","submitted_at":"2023-04-06T17:58:09Z","abstract_excerpt":"Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are needed. In this paper, we present the first attempt to use GPT-4 to generate instruction-following data for LLM finetuning. Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by pr"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That performance gains on the chosen zero-shot tasks reflect genuine improvement in instruction following rather than artifacts of GPT-4's data generation process or evaluation choices.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GPT-4 generated instruction data enables LLaMA models to reach higher zero-shot performance on new tasks than earlier synthetic datasets.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3e5864af4ec74bde1b4fb3b6168781c91bd2d762b48e55b99d23d6306483114d"},"source":{"id":"2304.03277","kind":"arxiv","version":1},"verdict":{"id":"7170c55c-42cb-42e8-b30e-0c41f8928e3d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T16:57:59.957977Z","strongest_claim":"Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks to the instruction-following data generated by previous state-of-the-art models.","one_line_summary":"GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That performance gains on the chosen zero-shot tasks reflect genuine improvement in instruction following rather than artifacts of GPT-4's data generation process or evaluation choices.","pith_extraction_headline":"GPT-4 generated instruction data enables LLaMA models to reach higher zero-shot performance on new tasks than earlier synthetic datasets."},"references":{"count":18,"sample":[{"doi":"","year":null,"title":"A General Language Assistant as a Laboratory for Alignment","work_id":"a43f9ea0-01be-47d5-b8ee-a1a9f73381c5","ref_index":1,"cited_arxiv_id":"2112.00861","is_internal_anchor":true},{"doi":"10.5281/zenodo.7733589","year":null,"title":"Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al","work_id":"6f218053-cca5-4fde-92aa-730589931f0c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":3,"cited_arxiv_id":"2212.08073","is_internal_anchor":true},{"doi":"10.5281/zenodo.5297715","year":1901,"title":"GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , March 2021","work_id":"6c7f8a44-6f52-448c-b819-5ba82a7bbc59","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","ref_index":5,"cited_arxiv_id":"2210.11416","is_internal_anchor":true}],"resolved_work":18,"snapshot_sha256":"086a76bdf9860cd1a85670949cbde7831e84f7f71ab1d678f23db21772a9c2ed","internal_anchors":11},"formal_canon":{"evidence_count":1,"snapshot_sha256":"c26f16a8dcc7d1eb015e86884555556a48859d117881c5f57bf53f828be0fea0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}