{"paper":{"title":"Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Griffin mixes gated linear recurrences with local attention to match Llama-2 performance on far fewer tokens.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"Albert Gu, Aleksandar Botev, Anushan Fernando, Arnaud Doucet, Caglar Gulcehre, David Budden, George Cristian-Muraru, Guillaume Desjardins, Leonard Berrada, Nando de Freitas, Razvan Pascanu, Ruba Haroun, Samuel L. Smith, Soham De, Srivatsan Srinivasan, Yee Whye Teh, Yutian Chen","submitted_at":"2024-02-29T18:24:46Z","abstract_excerpt":"Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware effici"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the reported performance equivalence to Llama-2 generalizes beyond the specific benchmarks and training setup described, with no post-hoc data selection affecting the comparison.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Griffin mixes gated linear recurrences with local attention to match Llama-2 performance on far fewer tokens.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"453eeeab71e329b472c5055ed34a92779b56f915ac0e66b45e67483c37408c2c"},"source":{"id":"2402.19427","kind":"arxiv","version":1},"verdict":{"id":"191ea8e5-39d6-4cc5-9ed8-929dbf510981","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T06:53:29.822544Z","strongest_claim":"Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens.","one_line_summary":"Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the reported performance equivalence to Llama-2 generalizes beyond the specific benchmarks and training setup described, with no post-hoc data selection affecting the comparison.","pith_extraction_headline":"Griffin mixes gated linear recurrences with local attention to match Llama-2 performance on far fewer tokens."},"references":{"count":40,"sample":[{"doi":"","year":null,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":null,"title":"Neural Machine Translation by Jointly Learning to Align and Translate","work_id":"d831e763-d530-4029-a65c-ac595d82cb2a","ref_index":2,"cited_arxiv_id":"1409.0473","is_internal_anchor":true},{"doi":"","year":2004,"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","ref_index":3,"cited_arxiv_id":"2004.05150","is_internal_anchor":true},{"doi":"","year":null,"title":"Quasi-recurrent Neural Networks","work_id":"a3b3168b-5ab0-4a88-9e57-969e93933291","ref_index":4,"cited_arxiv_id":"1611.01576","is_internal_anchor":true},{"doi":"","year":1901,"title":"URLhttp://github.com/google/jax. T.Brown,B.Mann,N.Ryder,M.Subbiah,J.D.Kaplan,P.Dhariwal,A.Neelakantan,P.Shyam,G.Sastry, A. Askell, et al. Language models are few-shot learners. InAdvances in Neural In","work_id":"c7454809-842f-4823-bba3-768ab6263ede","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":40,"snapshot_sha256":"065d092f063504bc17e15c380070be97db7931f07b1762b98a1f6467824e6a2f","internal_anchors":25},"formal_canon":{"evidence_count":3,"snapshot_sha256":"9677d5ad37353891e8b6813eeb15655cc2cf0c7da05955a041567b569424ecf5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}