{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:3WYX72X5SQT2EY5MTNNX24CJYY","short_pith_number":"pith:3WYX72X5","schema_version":"1.0","canonical_sha256":"ddb17feafd9427a263ac9b5b7d7049c62847350bb03b79d8d8bae0146738cb33","source":{"kind":"arxiv","id":"2404.19737","version":1},"attestation_state":"computed","paper":{"title":"Better & Faster Large Language Models via Multi-token Prediction","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Training language models to predict multiple future tokens improves coding performance and speeds up inference","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Badr Youbi Idrissi, Baptiste Rozi\\`ere, David Lopez-Paz, Fabian Gloeckle, Gabriel Synnaeve","submitted_at":"2024-04-30T17:33:57Z","abstract_excerpt":"Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. T"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2404.19737","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2024-04-30T17:33:57Z","cross_cats_sorted":[],"title_canon_sha256":"193f5844a8c69bc27a3c9495377ee7c2f62d7a994b0bf3c8691d4309735d0477","abstract_canon_sha256":"416350975cb8b534fbb7c32d3860af343ee9d2b1896c0f7273bab7be92c7a500"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.884803Z","signature_b64":"3sn5tgtNhew8T6iCO5Jo9BlYk4viikUEMLtAuK6BveXMPQU85DO6AOfL6w9pnahZt57BPU2tWq4LJw6x9vcqDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"ddb17feafd9427a263ac9b5b7d7049c62847350bb03b79d8d8bae0146738cb33","last_reissued_at":"2026-05-17T23:38:47.884145Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.884145Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Better & Faster Large Language Models via Multi-token Prediction","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Training language models to predict multiple future tokens improves coding performance and speeds up inference","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Badr Youbi Idrissi, Baptiste Rozi\\`ere, David Lopez-Paz, Fabian Gloeckle, Gabriel Synnaeve","submitted_at":"2024-04-30T17:33:57Z","abstract_excerpt":"Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. T"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. ... models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the reported gains are caused by the multi-token auxiliary objective rather than differences in hyper-parameters, data ordering, or other uncontrolled training details, and that the benefit persists without degradation at much larger scales.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Training language models to predict multiple future tokens improves coding performance and speeds up inference","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d4d49512641eba82c64914f3c3652ed359b9057ed485b987ed8e4889cf09bf72"},"source":{"id":"2404.19737","kind":"arxiv","version":1},"verdict":{"id":"ac4cbfaa-290a-4952-a366-e7a29b1d2974","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T12:21:13.563553Z","strongest_claim":"Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. ... models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.","one_line_summary":"Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the reported gains are caused by the multi-token auxiliary objective rather than differences in hyper-parameters, data ordering, or other uncontrolled training details, and that the benefit persists without degradation at much larger scales.","pith_extraction_headline":"Training language models to predict multiple future tokens improves coding performance and speeds up inference"},"references":{"count":23,"sample":[{"doi":"","year":null,"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","ref_index":1,"cited_arxiv_id":"2108.07732","is_internal_anchor":true},{"doi":"","year":null,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":2,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"","year":null,"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","ref_index":3,"cited_arxiv_id":"2110.14168","is_internal_anchor":true},{"doi":"","year":null,"title":"High Fidelity Neural Audio Compression","work_id":"bc645d2d-e9f2-4cb8-9a6d-bd557bc7a258","ref_index":4,"cited_arxiv_id":"2210.13438","is_internal_anchor":true},{"doi":"","year":2021,"title":"Leveraging parsbert and pretrained mt5 for persian abstractive text summarization","work_id":"0934b847-0d23-40fe-839c-92f6349edf54","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":23,"snapshot_sha256":"78c12a8dcc5e16eecc078d4d56bbcb3d6b88ea83927ec1efd81170549278a083","internal_anchors":5},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0f693f76cd804dd6c5607a2418482231f9912992231dcd77e6e558e9e23cc471"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.19737","created_at":"2026-05-17T23:38:47.884256+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.19737v1","created_at":"2026-05-17T23:38:47.884256+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.19737","created_at":"2026-05-17T23:38:47.884256+00:00"},{"alias_kind":"pith_short_12","alias_value":"3WYX72X5SQT2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"3WYX72X5SQT2EY5M","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"3WYX72X5","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":29,"sample":[{"citing_arxiv_id":"2605.12456","citing_title":"TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22422","citing_title":"FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15871","citing_title":"Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design","ref_index":114,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20104","citing_title":"Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16709","citing_title":"Covert Multi-bit LLM Watermarking: An Information Theory and Coding Approach","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01449","citing_title":"LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2507.19247","citing_title":"A Markov Categorical Framework for Language Modeling","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2508.16745","citing_title":"Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14671","citing_title":"Mirai: Autoregressive Visual Generation Needs Foresight","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2601.22925","citing_title":"BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04289","citing_title":"Proxy Compression for Language Modeling","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00110","citing_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04791","citing_title":"Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24527","citing_title":"Training Agents Inside of Scalable World Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14227","citing_title":"DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26752","citing_title":"GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12460","citing_title":"Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12456","citing_title":"TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11577","citing_title":"BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2601.02780","citing_title":"MiMo-V2-Flash Technical Report","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26412","citing_title":"When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26752","citing_title":"GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26412","citing_title":"When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09630","citing_title":"Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25317","citing_title":"FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture","ref_index":17,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY","json":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY.json","graph_json":"https://pith.science/api/pith-number/3WYX72X5SQT2EY5MTNNX24CJYY/graph.json","events_json":"https://pith.science/api/pith-number/3WYX72X5SQT2EY5MTNNX24CJYY/events.json","paper":"https://pith.science/paper/3WYX72X5"},"agent_actions":{"view_html":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY","download_json":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY.json","view_paper":"https://pith.science/paper/3WYX72X5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.19737&json=true","fetch_graph":"https://pith.science/api/pith-number/3WYX72X5SQT2EY5MTNNX24CJYY/graph.json","fetch_events":"https://pith.science/api/pith-number/3WYX72X5SQT2EY5MTNNX24CJYY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY/action/storage_attestation","attest_author":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY/action/author_attestation","sign_citation":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY/action/citation_signature","submit_replication":"https://pith.science/pith/3WYX72X5SQT2EY5MTNNX24CJYY/action/replication_record"}},"created_at":"2026-05-17T23:38:47.884256+00:00","updated_at":"2026-05-17T23:38:47.884256+00:00"}