{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2022:7FGC55WLCZKGVGCWA63UZW3XXX","short_pith_number":"pith:7FGC55WL","schema_version":"1.0","canonical_sha256":"f94c2ef6cb16546a985607b74cdb77bdecf4b252b7eda6367d020b0fe46c71c8","source":{"kind":"arxiv","id":"2211.17192","version":2},"attestation_state":"computed","paper":{"title":"Fast Inference from Transformers via Speculative Decoding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"Matan Kalman, Yaniv Leviathan, Yossi Matias","submitted_at":"2022-11-30T17:33:28Z","abstract_excerpt":"Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2211.17192","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2022-11-30T17:33:28Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"09f8a8477b22a0ae2be2f682fdccd88bb0a06063c9f966e27e3edaf14d91ee52","abstract_canon_sha256":"722a11c4fbeb93e2a0d8e8109b2373e3064887a6df7d2970ce294692c809004e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:12.750676Z","signature_b64":"EMCuHNopSWkxRgTrRcw1/v/btUNX/hWcGnig+gBQoiGd1+gqgXqu43+Yvq3SqA5W/xMmoHaIcyblFM4te6+NCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"f94c2ef6cb16546a985607b74cdb77bdecf4b252b7eda6367d020b0fe46c71c8","last_reissued_at":"2026-05-17T23:38:12.749938Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:12.749938Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Fast Inference from Transformers via Speculative Decoding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution.","cross_cats":["cs.CL"],"primary_cat":"cs.LG","authors_text":"Matan Kalman, Yaniv Leviathan, Yossi Matias","submitted_at":"2022-11-30T17:33:28Z","abstract_excerpt":"Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That sufficiently accurate and faster approximation models exist for the subtasks inside typical language-modeling workloads, so that the draft model produces enough accepted tokens to offset the overhead of the verification step.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e9495c274f62a12aede7daa942482095e524f8727dfa4a1e50a7aa7ea0484f7b"},"source":{"id":"2211.17192","kind":"arxiv","version":2},"verdict":{"id":"34fa2f82-a834-45fb-a491-aa45e950cc88","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T22:47:38.700405Z","strongest_claim":"Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.","one_line_summary":"Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That sufficiently accurate and faster approximation models exist for the subtasks inside typical language-modeling workloads, so that the draft model produces enough accepted tokens to offset the overhead of the verification step.","pith_extraction_headline":"Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution."},"references":{"count":67,"sample":[{"doi":"","year":2020,"title":"Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarw","work_id":"cb344385-2776-46ff-84e8-d8811c8139c2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"LaMDA: Language Models for Dialog Applications , author=. ArXiv , year=","work_id":"91238c16-51f7-41b5-9352-9084b7a7c73b","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. ArXiv , year=","work_id":"93496c82-9fc2-4a9b-84c9-216e7665e8de","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"PaLM: Scaling Language Modeling with Pathways , author=. ArXiv , year=","work_id":"ee1f4f00-3f86-4412-8473-bf8a673f0c59","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding , author=. ArXiv , year=","work_id":"07a3c9d4-e172-4ff8-9e51-aa246367887c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":67,"snapshot_sha256":"1557fb37f63b03ef71a21963633938c453fd1bba91a0b3e78c45b4d623164319","internal_anchors":9},"formal_canon":{"evidence_count":2,"snapshot_sha256":"da48a74510db87b4523392595a858c5901afebfb111283d400f78faf2bf5d6e6"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2211.17192","created_at":"2026-05-17T23:38:12.750054+00:00"},{"alias_kind":"arxiv_version","alias_value":"2211.17192v2","created_at":"2026-05-17T23:38:12.750054+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2211.17192","created_at":"2026-05-17T23:38:12.750054+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":16,"internal_anchor_count":16,"sample":[{"citing_arxiv_id":"2601.14053","citing_title":"LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28049","citing_title":"Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12825","citing_title":"Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10774","citing_title":"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27476","citing_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09375","citing_title":"31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10453","citing_title":"SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02285","citing_title":"Complexity Horizons of Compressed Models in Analog Circuit Analysis","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12358","citing_title":"Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12110","citing_title":"SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15356","citing_title":"Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2305.13245","citing_title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05250","citing_title":"DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16957","citing_title":"Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17397","citing_title":"Speculative Decoding for Autoregressive Video Generation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19642","citing_title":"Micro Language Models Enable Instant Responses","ref_index":5,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX","json":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX.json","graph_json":"https://pith.science/api/pith-number/7FGC55WLCZKGVGCWA63UZW3XXX/graph.json","events_json":"https://pith.science/api/pith-number/7FGC55WLCZKGVGCWA63UZW3XXX/events.json","paper":"https://pith.science/paper/7FGC55WL"},"agent_actions":{"view_html":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX","download_json":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX.json","view_paper":"https://pith.science/paper/7FGC55WL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2211.17192&json=true","fetch_graph":"https://pith.science/api/pith-number/7FGC55WLCZKGVGCWA63UZW3XXX/graph.json","fetch_events":"https://pith.science/api/pith-number/7FGC55WLCZKGVGCWA63UZW3XXX/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX/action/timestamp_anchor","attest_storage":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX/action/storage_attestation","attest_author":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX/action/author_attestation","sign_citation":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX/action/citation_signature","submit_replication":"https://pith.science/pith/7FGC55WLCZKGVGCWA63UZW3XXX/action/replication_record"}},"created_at":"2026-05-17T23:38:12.750054+00:00","updated_at":"2026-05-17T23:38:12.750054+00:00"}