{"paper":{"title":"The Falcon Series of Open Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Falcon-180B, trained on 3.5 trillion tokens from web data, nears PaLM-2-Large performance at lower pretraining and inference cost.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Abdulaziz Alshamsi, Alessandro Cappelli, Badreddine Noune, Baptiste Pannier, Daniele Mazzotta, Daniel Hesslow, Ebtesam Almazrouei, \\'Etienne Goffinet, Guilherme Penedo, Hamza Alobeidli, Julien Launay, M\\'erouane Debbah, Quentin Malartic, Ruxandra Cojocaru","submitted_at":"2023-11-28T15:12:47Z","abstract_excerpt":"We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best languag"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Falcon-180B significantly outperforms models such as PaLM or Chinchilla, improves upon LLaMA 2 or Inflection-1, and nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it one of the three best language models in the world along with GPT-4 and PaLM-2-Large.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the reported benchmark results reflect genuine capability gains rather than differences in evaluation protocols, data contamination, or undisclosed advantages in testing conditions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Falcon-180B, trained on 3.5 trillion tokens from web data, nears PaLM-2-Large performance at lower pretraining and inference cost.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0687c859fbc9be277aa3c4ea2d20cfaedd6de9fcafb9d9aab4e5063b6a43463a"},"source":{"id":"2311.16867","kind":"arxiv","version":2},"verdict":{"id":"265a6f6c-7ccd-4c62-9c16-0eecd9b67950","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T09:42:50.408969Z","strongest_claim":"Falcon-180B significantly outperforms models such as PaLM or Chinchilla, improves upon LLaMA 2 or Inflection-1, and nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it one of the three best language models in the world along with GPT-4 and PaLM-2-Large.","one_line_summary":"Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the reported benchmark results reflect genuine capability gains rather than differences in evaluation protocols, data contamination, or undisclosed advantages in testing conditions.","pith_extraction_headline":"Falcon-180B, trained on 3.5 trillion tokens from web data, nears PaLM-2-Large performance at lower pretraining and inference cost."},"references":{"count":269,"sample":[{"doi":"","year":null,"title":"Warp size impact in GPUs: large or small? , author=. GPGPU@ASPLOS , year=","work_id":"ba0323a5-dcd1-4d44-8ace-ad2a3c972e0e","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization , author=. ArXiv , year=","work_id":"e4b3ee0a-80de-4f37-a4d9-bd3a21a97ed0","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Yiwei Yang, Chung Peng Lee, Shangbin Feng, Dora Zhao, Bingbing Wen, Anthony Zhe Liu, Yulia Tsvetkov, and Bill Howe","work_id":"22ad4cec-5465-4cdb-a2aa-ace82b84b5e9","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The Power of Scale for Parameter-Efficient Prompt Tuning","work_id":"1056ba8e-7b3f-4811-be8e-9a3ed9269acb","ref_index":4,"cited_arxiv_id":"2104.08691","is_internal_anchor":true},{"doi":"","year":null,"title":"Bitfit: Simple parameter- efficient fine-tuning for transformer-based masked language-models","work_id":"8505729c-c88b-43bd-8e2c-2c94644ca438","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":269,"snapshot_sha256":"30ebd1fba5e2e0751becb6abca6595982cf1005f242a49e527c4c6b3923e8ac5","internal_anchors":66},"formal_canon":{"evidence_count":2,"snapshot_sha256":"62a45dd550d22bac9d8e8a6648af1e09895b937d9fe0a7ae1665e4013a052026"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}