{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:7YXM6JRYOH4O5NBMCHKEPDPSW2","short_pith_number":"pith:7YXM6JRY","schema_version":"1.0","canonical_sha256":"fe2ecf263871f8eeb42c11d4478df2b69b77748d33f8d92acab2b44d81666059","source":{"kind":"arxiv","id":"2404.14294","version":3},"attestation_state":"computed","paper":{"title":"A Survey on Efficient Inference for Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A survey organizes methods for efficient large language model inference into data-level, model-level, and system-level categories and benchmarks representative techniques.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Guohao Dai, Jiaming Xu, Ke Hong, Luning Wang, Shengen Yan, Shiyao Li, Tianyu Fu, Xiao-Ping Zhang, Xiuhong Li, Xuefei Ning, Yuhan Dong, Yuming Lou, Yu Wang, Zhihang Yuan, Zixuan Zhou","submitted_at":"2024-04-22T15:53:08Z","abstract_excerpt":"Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the q"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2404.14294","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"0158e010d7858a65e7781dd03ec62b813bbae982fd020a8150281fd273403c03","abstract_canon_sha256":"7e45755716429abd0dc0e09cd3eff786a25f8857d55ccb1a7e23f2fa7d08b786"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.799025Z","signature_b64":"CyNTe/J2ssLwivPolec0iaoEw0/jBibZH7YlVEsxJzBzTkoqsiGhYAO19D3Zt4q82iU38Dy45hiqntBKvSSCAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"fe2ecf263871f8eeb42c11d4478df2b69b77748d33f8d92acab2b44d81666059","last_reissued_at":"2026-05-17T23:38:53.798407Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.798407Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"A Survey on Efficient Inference for Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A survey organizes methods for efficient large language model inference into data-level, model-level, and system-level categories and benchmarks representative techniques.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Guohao Dai, Jiaming Xu, Ke Hong, Luning Wang, Shengen Yan, Shiyao Li, Tianyu Fu, Xiao-Ping Zhang, Xiuhong Li, Xuefei Ning, Yuhan Dong, Yuming Lou, Yu Wang, Zhihang Yuan, Zixuan Zhou","submitted_at":"2024-04-22T15:53:08Z","abstract_excerpt":"Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the q"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"This paper presents a comprehensive survey of the existing literature on efficient LLM inference... organized into data-level, model-level, and system-level optimization... with comparative experiments on representative methods.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen representative methods and experimental comparisons fairly represent the broader literature and yield generalizable quantitative insights without significant selection bias.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A survey organizes methods for efficient large language model inference into data-level, model-level, and system-level categories and benchmarks representative techniques.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"db7eec77f23bcee3f3c597a53f5ac8216a6eacdb8a60e15dfd01e928160a1905"},"source":{"id":"2404.14294","kind":"arxiv","version":3},"verdict":{"id":"1ab38c5d-7fdb-4953-a163-f60d1ae7e089","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:36:18.313400Z","strongest_claim":"This paper presents a comprehensive survey of the existing literature on efficient LLM inference... organized into data-level, model-level, and system-level optimization... with comparative experiments on representative methods.","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen representative methods and experimental comparisons fairly represent the broader literature and yield generalizable quantitative insights without significant selection bias.","pith_extraction_headline":"A survey organizes methods for efficient large language model inference into data-level, model-level, and system-level categories and benchmarks representative techniques."},"references":{"count":298,"sample":[{"doi":"","year":2018,"title":"Improving language understanding by generative pre-training,","work_id":"a7a0f0e5-46ea-4c45-916e-10a354ef7a75","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Language models are unsupervised multitask learners","work_id":"9fb276fb-e836-4b02-aa1b-f31321e69d94","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"ba44e148-856c-498e-aded-be65cf943446","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","ref_index":4,"cited_arxiv_id":"2205.01068","is_internal_anchor":true},{"doi":"","year":2023,"title":"Baichuan 2: Open large-scale language models","work_id":"9ba8f898-3900-4776-b82e-11e767a86ba9","ref_index":6,"cited_arxiv_id":"2309.10305","is_internal_anchor":false}],"resolved_work":298,"snapshot_sha256":"3371180055fbfce2246d8816adb0c736ac16d95c49f32ea8e91bc7b5961557a5","internal_anchors":41},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c945af8a0d0aa36253f04d5fc6ccb3ba31d21c787614f8283ddfc3ef053a6a17"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.14294","created_at":"2026-05-17T23:38:53.798519+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.14294v3","created_at":"2026-05-17T23:38:53.798519+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.14294","created_at":"2026-05-17T23:38:53.798519+00:00"},{"alias_kind":"pith_short_12","alias_value":"7YXM6JRYOH4O","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"7YXM6JRYOH4O5NBM","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"7YXM6JRY","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":36,"internal_anchor_count":36,"sample":[{"citing_arxiv_id":"2503.14075","citing_title":"Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2505.02380","citing_title":"EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2505.13255","citing_title":"Policy Contrastive Decoding for Robotic Foundation Models","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2603.19199","citing_title":"FASTER: Rethinking Real-Time Flow VLAs","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07035","citing_title":"Unified Deployment-Aware Evaluation of Open Reasoning Language Models","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20315","citing_title":"Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18597","citing_title":"Latent Action Reparameterization for Efficient Agent Inference","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16535","citing_title":"RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2506.12876","citing_title":"MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2511.06838","citing_title":"P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09427","citing_title":"ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2602.17697","citing_title":"Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15889","citing_title":"Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2602.10144","citing_title":"When LLMs get significantly worse: A statistical approach to detect model degradations","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2603.19199","citing_title":"FASTER: Rethinking Real-Time Flow VLAs","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19757","citing_title":"Transparent Screening for LLM Inference and Training Impacts","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27112","citing_title":"RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03298","citing_title":"ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13734","citing_title":"KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02985","citing_title":"Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26209","citing_title":"Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04738","citing_title":"OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19351","citing_title":"DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19167","citing_title":"LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10484","citing_title":"Strix: Re-thinking NPU Reliability from a System Perspective","ref_index":65,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2","json":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2.json","graph_json":"https://pith.science/api/pith-number/7YXM6JRYOH4O5NBMCHKEPDPSW2/graph.json","events_json":"https://pith.science/api/pith-number/7YXM6JRYOH4O5NBMCHKEPDPSW2/events.json","paper":"https://pith.science/paper/7YXM6JRY"},"agent_actions":{"view_html":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2","download_json":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2.json","view_paper":"https://pith.science/paper/7YXM6JRY","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.14294&json=true","fetch_graph":"https://pith.science/api/pith-number/7YXM6JRYOH4O5NBMCHKEPDPSW2/graph.json","fetch_events":"https://pith.science/api/pith-number/7YXM6JRYOH4O5NBMCHKEPDPSW2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2/action/storage_attestation","attest_author":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2/action/author_attestation","sign_citation":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2/action/citation_signature","submit_replication":"https://pith.science/pith/7YXM6JRYOH4O5NBMCHKEPDPSW2/action/replication_record"}},"created_at":"2026-05-17T23:38:53.798519+00:00","updated_at":"2026-05-17T23:38:53.798519+00:00"}