{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:HR6EXQBOEZBW37EITU7YOOTWPH","short_pith_number":"pith:HR6EXQBO","schema_version":"1.0","canonical_sha256":"3c7c4bc02e26436dfc889d3f873a7679f57fa512b1b9f839c9e9d5d11ce7a4e0","source":{"kind":"arxiv","id":"2306.02707","version":1},"attestation_state":"computed","paper":{"title":"Orca: Progressive Learning from Complex Explanation Traces of GPT-4","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A 13B model trained on GPT-4's step-by-step explanations reaches ChatGPT parity on complex reasoning benchmarks.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Ahmed Awadallah, Arindam Mitra, Ganesh Jawahar, Hamid Palangi, Sahaj Agarwal, Subhabrata Mukherjee","submitted_at":"2023-06-05T08:58:39Z","abstract_excerpt":"Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2306.02707","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-06-05T08:58:39Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"4daa66731f0c8281c32fb5ae122014e12d616b75e13008bace6110b67d198d9a","abstract_canon_sha256":"f99376ddf09c239247038747abe31c74b056ace90aab7d796babbd73df4813f5"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.866752Z","signature_b64":"1MjFhSO0DC5fiwg6rIP9kmoZnI4KyaGKkhPA6rvLV8Ffw8xgMx6DIswuKrejvKauOD1w9fEsR/PwYZIs9J0hDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"3c7c4bc02e26436dfc889d3f873a7679f57fa512b1b9f839c9e9d5d11ce7a4e0","last_reissued_at":"2026-05-17T23:38:52.866062Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.866062Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Orca: Progressive Learning from Complex Explanation Traces of GPT-4","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A 13B model trained on GPT-4's step-by-step explanations reaches ChatGPT parity on complex reasoning benchmarks.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Ahmed Awadallah, Arindam Mitra, Ganesh Jawahar, Hamid Palangi, Sahaj Agarwal, Subhabrata Mukherjee","submitted_at":"2023-06-05T08:58:39Z","abstract_excerpt":"Recent research has focused on enhancing the capability of smaller models through imitation learning, drawing on the outputs generated by large foundation models (LFMs). A number of issues impact the quality of these models, ranging from limited imitation signals from shallow LFM outputs; small scale homogeneous training data; and most notably a lack of rigorous evaluation resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the reasoning process of LFMs. To address these challenges, we develop Orca (We are working with our legal team to "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the imitation data's explanation traces cause genuine transfer of reasoning processes rather than style or pattern matching, and that benchmark gains reflect true capability improvements rather than data contamination or evaluation artifacts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A 13B model trained on GPT-4's step-by-step explanations reaches ChatGPT parity on complex reasoning benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9f7840789e65d0a7165417f35cf141b5fcd234c0a156ffa0c0b820a7e2e4561c"},"source":{"id":"2306.02707","kind":"arxiv","version":1},"verdict":{"id":"55f8d8da-99fb-4119-8fff-f36be47063c8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T09:39:05.871763Z","strongest_claim":"Orca reaches parity with ChatGPT on the BBH benchmark and shows competitive performance (4 pts gap with optimized system message) in professional and academic examinations like the SAT, LSAT, GRE, and GMAT, both in zero-shot settings without CoT; while trailing behind GPT-4.","one_line_summary":"A 13B model called Orca learns detailed reasoning from GPT-4 explanation traces and reaches parity with ChatGPT on Big-Bench Hard while outperforming other 13B models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the imitation data's explanation traces cause genuine transfer of reasoning processes rather than style or pattern matching, and that benchmark gains reflect true capability improvements rather than data contamination or evaluation artifacts.","pith_extraction_headline":"A 13B model trained on GPT-4's step-by-step explanations reaches ChatGPT parity on complex reasoning benchmarks."},"references":{"count":37,"sample":[{"doi":"","year":2023,"title":"Agieval: A human-centric benchmark for evaluating foundation models, 2023","work_id":"9bacb1bc-b7a5-49e6-a109-3192ad8f348c","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Mich","work_id":"50a8a876-91bb-4d5c-9f64-a06c4bda5da6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022","work_id":"02935801-5aaa-40bd-b51c-9e8e68732183","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Training language models to follow instructions with human feedback","work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","ref_index":5,"cited_arxiv_id":"2203.02155","is_internal_anchor":true},{"doi":"","year":2022,"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","ref_index":6,"cited_arxiv_id":"2212.08073","is_internal_anchor":true}],"resolved_work":37,"snapshot_sha256":"e085ba0c08d3e0db86849b27432add2fed1ee4d01f75fee1fc557a250f76ec81","internal_anchors":5},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2306.02707","created_at":"2026-05-17T23:38:52.866176+00:00"},{"alias_kind":"arxiv_version","alias_value":"2306.02707v1","created_at":"2026-05-17T23:38:52.866176+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2306.02707","created_at":"2026-05-17T23:38:52.866176+00:00"},{"alias_kind":"pith_short_12","alias_value":"HR6EXQBOEZBW","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"HR6EXQBOEZBW37EI","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"HR6EXQBO","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2409.18169","citing_title":"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2501.05465","citing_title":"Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)","ref_index":95,"is_internal_anchor":true},{"citing_arxiv_id":"2502.09487","citing_title":"Internal narratives parameterise affective states","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06638","citing_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16379","citing_title":"An Information-Theoretic Criterion for Efficient Data Synthesis","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2508.15202","citing_title":"Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2509.22075","citing_title":"CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2408.00724","citing_title":"Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models","ref_index":274,"is_internal_anchor":true},{"citing_arxiv_id":"2309.05653","citing_title":"MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2402.13116","citing_title":"A Survey on Knowledge Distillation of Large Language Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2504.21318","citing_title":"Phi-4-reasoning Technical Report","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2601.20255","citing_title":"HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2309.14525","citing_title":"Aligning Large Multimodal Models with Factually Augmented RLHF","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14071","citing_title":"Distribution Corrected Offline Data Distillation for Large Language Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10720","citing_title":"Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10999","citing_title":"SkillGen: Verified Inference-Time Agent Skill Synthesis","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2306.11644","citing_title":"Textbooks Are All You Need","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11629","citing_title":"OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08472","citing_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06638","citing_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08838","citing_title":"Generating Leakage-Free Benchmarks for Robust RAG Evaluation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04078","citing_title":"Validity-Calibrated Reasoning Distillation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24819","citing_title":"Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06638","citing_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2402.06196","citing_title":"Large Language Models: A Survey","ref_index":96,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH","json":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH.json","graph_json":"https://pith.science/api/pith-number/HR6EXQBOEZBW37EITU7YOOTWPH/graph.json","events_json":"https://pith.science/api/pith-number/HR6EXQBOEZBW37EITU7YOOTWPH/events.json","paper":"https://pith.science/paper/HR6EXQBO"},"agent_actions":{"view_html":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH","download_json":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH.json","view_paper":"https://pith.science/paper/HR6EXQBO","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2306.02707&json=true","fetch_graph":"https://pith.science/api/pith-number/HR6EXQBOEZBW37EITU7YOOTWPH/graph.json","fetch_events":"https://pith.science/api/pith-number/HR6EXQBOEZBW37EITU7YOOTWPH/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH/action/timestamp_anchor","attest_storage":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH/action/storage_attestation","attest_author":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH/action/author_attestation","sign_citation":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH/action/citation_signature","submit_replication":"https://pith.science/pith/HR6EXQBOEZBW37EITU7YOOTWPH/action/replication_record"}},"created_at":"2026-05-17T23:38:52.866176+00:00","updated_at":"2026-05-17T23:38:52.866176+00:00"}