arxiv: 2404.14294 · v3 · submitted 2024-04-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou , Xuefei Ning , Ke Hong , Tianyu Fu , Jiaming Xu , Shiyao Li , Yuming Lou , Luning Wang

show 7 more authors

Zhihang Yuan Xiuhong Li Shengen Yan Guohao Dai Xiao-Ping Zhang Yuhan Dong Yu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM inference efficiencysurveymodel quantizationattention optimizationsystem-level servingcomparative benchmarkslarge language models

0 comments

The pith

A survey organizes methods for efficient large language model inference into data-level, model-level, and system-level categories and benchmarks representative techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies three root causes of slow and memory-heavy LLM inference: oversized models, quadratic attention, and token-by-token generation. It groups the literature into a three-part taxonomy and runs side-by-side experiments on key methods to quantify speed and memory trade-offs. A reader gains a practical map for selecting or combining optimizations when deploying models under hardware limits.

Core claim

The central claim is that existing efficiency techniques can be systematically classified into data-level (quantization, pruning, distillation), model-level (sparse attention, efficient architectures), and system-level (kernel fusion, serving frameworks) optimizations, with comparative experiments revealing consistent patterns in latency and memory reduction across these categories.

What carries the argument

The three-tier taxonomy of data-level, model-level, and system-level optimization, which structures the surveyed methods and supports direct quantitative comparison of their efficiency gains.

If this is right

Data-level methods such as quantization reduce memory usage while preserving most accuracy.
Model-level changes like sparse attention cut the quadratic cost of self-attention.
System-level improvements raise throughput in multi-user serving without altering the model.
Hybrid combinations across levels produce larger gains than isolated techniques.
The taxonomy supplies a basis for future automated selection of optimization stacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The quantitative comparisons could inform hardware-aware selection rules for edge versus cloud deployment.
The same three-level structure may extend to efficient training or multimodal inference pipelines.
Researchers could test whether the taxonomy remains stable when applied to mixture-of-experts or state-space models.

Load-bearing premise

The chosen representative methods and experimental setups fairly capture performance differences across the broader literature without significant selection bias.

What would settle it

A controlled replication that places a widely used technique outside the three categories or shows that the reported speedups disappear on models larger than those tested would falsify the taxonomy's completeness and generalizability.

read the original abstract

Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LLM inference efficiency work into a sensible taxonomy and adds some comparative experiments, but the selection process for those experiments is not explained clearly enough.

read the letter

This paper is a survey that organizes existing techniques for faster and lighter LLM inference. It starts with the main sources of inefficiency—large model sizes, quadratic attention, and autoregressive decoding—then groups the fixes into data-level, model-level, and system-level categories. The taxonomy is straightforward and groups related ideas without stretching. Adding side-by-side experiments on representative methods from key sub-areas gives readers some concrete numbers on trade-offs, which is more helpful than a pure literature list.

Referee Report

1 major / 1 minor

Summary. This paper surveys techniques for efficient inference in Large Language Models. It identifies primary causes of inefficiency (large model size, quadratic attention complexity, and autoregressive decoding), organizes the literature via a taxonomy into data-level, model-level, and system-level optimizations, presents comparative experiments on representative methods in key sub-fields to supply quantitative insights, and discusses future directions.

Significance. If the experimental comparisons hold, the survey offers a useful organizing framework for a fast-growing area and supplies concrete quantitative benchmarks that can inform deployment decisions. The taxonomy and experiments together provide more actionable guidance than a purely descriptive review.

major comments (1)

[Comparative Experiments] Section describing the comparative experiments: the manuscript states that experiments were run on 'representative methods' but supplies no explicit, reproducible selection protocol (citation thresholds, recency cutoffs, implementation availability, or hardware filters). Without such criteria the reported speed/accuracy trade-offs cannot be shown to be free of selection bias and therefore do not reliably generalize to the full literature covered by the taxonomy.

minor comments (1)

[Abstract] Abstract: the phrase 'comparative experiments on representative methods' would be clearer if it named the primary metrics (e.g., latency, throughput, memory) and the number of methods compared.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. The feedback on the comparative experiments section is well-taken, and we have revised the manuscript to include an explicit, reproducible selection protocol. This strengthens the transparency and generalizability of the quantitative results.

read point-by-point responses

Referee: Section describing the comparative experiments: the manuscript states that experiments were run on 'representative methods' but supplies no explicit, reproducible selection protocol (citation thresholds, recency cutoffs, implementation availability, or hardware filters). Without such criteria the reported speed/accuracy trade-offs cannot be shown to be free of selection bias and therefore do not reliably generalize to the full literature covered by the taxonomy.

Authors: We agree that the original manuscript lacked a clear selection protocol, which limits reproducibility. In the revised version, we have added a dedicated subsection (now Section 4.1) that explicitly defines the criteria used: (1) methods with publicly available open-source implementations at the time of writing, (2) publications in top-tier venues (NeurIPS, ICML, ICLR, ACL, EMNLP) from 2022 onward, (3) coverage of at least one representative technique per major sub-category in the taxonomy, and (4) evaluation on consistent hardware (A100 GPUs) and model backbones (Llama-7B/13B). We also include a new table (Table 1) listing all selected methods with their original citations and implementation links. These additions directly address potential selection bias and allow readers to replicate or extend the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without self-referential derivations

full rationale

This is a survey paper that summarizes and taxonomizes existing work on efficient LLM inference into data-level, model-level, and system-level categories, with comparative experiments on representative methods drawn from the broader literature. No original derivations, equations, fitted parameters, or predictions are presented that could reduce to self-defined inputs by construction. All claims reference external citations, and the taxonomy serves as an organizational framework rather than a derived result. The selection of representative methods for experiments does not constitute circularity under the defined patterns, as it involves no self-definition, fitted-input renaming, or load-bearing self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a literature survey containing no new mathematical derivations, fitted parameters, or postulated entities; all content draws from previously published work.

pith-pipeline@v0.9.0 · 5490 in / 975 out tokens · 27408 ms · 2026-05-15T02:36:18.313400+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
cs.CL 2026-04 unverdicted novelty 7.0

Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
cs.PL 2026-04 unverdicted novelty 7.0

Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
cs.CL 2026-04 unverdicted novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
cs.CV 2026-03 unverdicted novelty 7.0

RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
cs.LG 2026-05 unverdicted novelty 6.0

OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
cs.AI 2026-05 unverdicted novelty 6.0

JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
cs.CV 2026-05 unverdicted novelty 6.0

A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
cs.CL 2026-04 unverdicted novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
cs.LG 2026-04 unverdicted novelty 6.0

LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
Strix: Re-thinking NPU Reliability from a System Perspective
cs.AR 2026-04 unverdicted novelty 6.0

Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
Paper Espresso: From Paper Overload to Research Insight
cs.DL 2026-04 unverdicted novelty 6.0

Paper Espresso deploys LLMs to summarize and analyze trends across 13,300+ arXiv papers over 35 months, releasing metadata that shows non-saturating topic growth and higher engagement for novel topics.
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
cs.IR 2026-04 conditional novelty 6.0

LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
cs.CV 2026-04 unverdicted novelty 5.0

A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...
Transparent Screening for LLM Inference and Training Impacts
cs.LG 2026-03 unverdicted novelty 5.0

The paper proposes a transparent proxy framework for estimating LLM inference and training environmental impacts from natural-language application descriptions.
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
cs.CL 2026-04 accept novelty 4.0

Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
cs.CL 2026-04 unverdicted novelty 4.0

A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research f...
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
cs.DC 2026-04 unverdicted novelty 2.0

This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · cited by 21 Pith papers · 35 internal anchors

[1]

Improving language understanding by generative pre-training,

A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al. , “Improving language understanding by generative pre-training,” 2018

work page 2018
[2]

Language models are unsupervised multitask learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al. , “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[3]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[4]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

CoRR , volume =

A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023

work page arXiv 2023
[7]

Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al. , “Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023

work page 2023
[8]

How long can context length of open- source llms truly promise?

D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can context length of open- source llms truly promise?” in NeurIPS 2023 Workshop on Instruc- tion Tuning and Instruction Following, 2023

work page 2023
[9]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagn´e, A. S. Luccioni, F. Yvon et al., “Bloom: A 176b-parameter open-access multilingual language model,”arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

The Falcon Series of Open Language Models , journal =

E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo- caru, M. Debbah, ´E. Goffinet, D. Hesslow, J. Launay, Q. Malartic et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023

work page arXiv 2023
[11]

Glm: General language model pretraining with autoregressive blank infilling

Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” arXiv preprint arXiv:2103.10360, 2021

work page arXiv 2021
[12]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Harnessing the power of llms in practice: A survey on chatgpt and beyond,

J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” ACM Transactions on Knowledge Discovery from Data, 2023

work page 2023
[14]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022

work page 2022
[15]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al. , “Eval- uating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P . Lee, Y. T. Lee, Y. Li, S. Lundberg et al. , “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

A survey on model compression for large language models

X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023

work page arXiv 2023
[18]

A comprehensive survey of compression algorithms for language models

S. Park, J. Choi, S. Lee, and U. Kang, “A comprehensive survey of compression algorithms for language models,” arXiv preprint arXiv:2401.15347, 2024

work page arXiv 2024
[19]

Model compression and efficient inference for large language models: A survey

W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin, D. Cai, and X. He, “Model compression and efficient infer- ence for large language models: A survey,” arXiv preprint arXiv:2402.09748, 2024

work page arXiv 2024
[20]

A survey on transformer compression,

Y. Tang, Y. Wang, J. Guo, Z. Tu, K. Han, H. Hu, and D. Tao, “A survey on transformer compression,” arXiv preprint arXiv:2402.05964, 2024

work page arXiv 2024
[21]

The efficiency spectrum of large language models: An algorithmic survey,

T. Ding, T. Chen, H. Zhu, J. Jiang, Y. Zhong, J. Zhou, G. Wang, Z. Zhu, I. Zharkov, and L. Liang, “The efficiency spectrum of large language models: An algorithmic survey,” arXiv preprint arXiv:2312.00678, 2023

work page arXiv 2023
[22]

Towards efficient generative large language model serving: A survey from algorithms to systems

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia, “Towards efficient generative large language model serving: A survey from algorithms to systems,” arXiv preprint arXiv:2312.15234, 2023

work page arXiv 2023
[23]

Efficient large language models: A survey

Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, Z. Qu, S. Yan, Y. Zhu, Q. Zhang, M. Chowdhury et al., “Efficient large language models: A survey,” arXiv preprint arXiv:2312.03863, vol. 1, 2023

work page arXiv 2023
[24]

A survey of resource-efficient llm and multimodal foundation models,

M. Xu, W. Yin, D. Cai, R. Yi, D. Xu, Q. Wang, B. Wu, Y. Zhao, C. Yang, S. Wang et al. , “A survey of resource-efficient llm and multimodal foundation models,” arXiv preprint arXiv:2401.08092, 2024

work page arXiv 2024
[25]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

work page 2017
[27]

Llm inference unveiled: Survey and roofline model insights,

Z. Yuan, Y. Shang, Y. Zhou, Z. Dong, C. Xue, B. Wu, Z. Li, Q. Gu, Y. J. Lee, Y. Yanet al., “Llm inference unveiled: Survey and roofline model insights,” arXiv preprint arXiv:2402.16363, 2024

work page arXiv 2024
[28]

Is flash attention stable?

A. Golden, S. Hsia, F. Sun, B. Acun, B. Hosmer, Y. Lee, Z. DeVito, J. Johnson, G.-Y. Wei, D. Brooks et al., “Is flash attention stable?” arXiv preprint arXiv:2405.02803, 2024

work page arXiv 2024
[29]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Ad- vances in Neural Information Processing Systems , vol. 33, pp. 9459– 9474, 2020

work page 2020
[30]

& Chen, D

A. Chevalier, A. Wettig, A. Ajith, and D. Chen, “Adapt- ing language models to compress contexts,” arXiv preprint arXiv:2305.14788, 2023

work page arXiv 2023
[31]

Replug: Retrieval-augmented black-box language models,

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W. tau Yih, “Replug: Retrieval-augmented black-box language models,” 2023

work page 2023
[32]

Self- rag: Learning to retrieve, generate, and critique through self- reflection,

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- rag: Learning to retrieve, generate, and critique through self- reflection,” 2023

work page 2023
[33]

Prompt compres- sion and contrastive conditioning for controllability and toxicity reduction in language models,

D. Wingate, M. Shoeybi, and T. Sorensen, “Prompt compres- sion and contrastive conditioning for controllability and toxicity reduction in language models,” arXiv preprint arXiv:2210.03162 , 2022

work page arXiv 2022
[34]

Learning to compress prompts with gist tokens,

J. Mu, X. L. Li, and N. Goodman, “Learning to compress prompts with gist tokens,” arXiv preprint arXiv:2304.08467, 2023. 30

work page arXiv 2023
[35]

In-context autoencoder for context compression in a large language model,

T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context autoencoder for context compression in a large language model,” arXiv preprint arXiv:2307.06945, 2023

work page arXiv 2023
[36]

Recomp: Improving retrieval- augmented lms with compression and selective augmentation,

F. Xu, W. Shi, and E. Choi, “Recomp: Improving retrieval- augmented lms with compression and selective augmentation,” arXiv preprint arXiv:2310.04408, 2023

work page arXiv 2023
[37]

Ex- tending context window of large language models via semantic compression,

W. Fei, X. Niu, P . Zhou, L. Hou, B. Bai, L. Deng, and W. Han, “Ex- tending context window of large language models via semantic compression,” arXiv preprint arXiv:2312.09571, 2023

work page arXiv 2023
[38]

Efficient prompting via dynamic in-context learning,

W. Zhou, Y. E. Jiang, R. Cotterell, and M. Sachan, “Efficient prompting via dynamic in-context learning,” arXiv preprint arXiv:2305.11170, 2023

work page arXiv 2023
[39]

Compressing context to enhance inference efficiency of large language models,

Y. Li, B. Dong, F. Guerin, and C. Lin, “Compressing context to enhance inference efficiency of large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 6342–6353

work page 2023
[40]

Did you read the instructions? rethinking the effectiveness of task defi- nitions in instruction learning,

F. Yin, J. Vig, P . Laban, S. Joty, C. Xiong, and C.-S. J. Wu, “Did you read the instructions? rethinking the effectiveness of task defi- nitions in instruction learning,” arXiv preprint arXiv:2306.01150 , 2023

work page arXiv 2023
[41]

Discrete prompt compression with reinforcement learning,

H. Jung and K.-J. Kim, “Discrete prompt compression with reinforcement learning,” arXiv preprint arXiv:2308.08758, 2023

work page arXiv 2023
[42]

Llmlingua: Compressing prompts for accelerated inference of large language models,

H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: Compressing prompts for accelerated inference of large language models,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[43]

Jiang, Q

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,” arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023
[44]

Boosting llm reasoning: Push the limits of few-shot learning with reinforced in-context pruning,

X. Huang, L. L. Zhang, K.-T. Cheng, and M. Yang, “Boosting llm reasoning: Push the limits of few-shot learning with reinforced in-context pruning,” arXiv preprint arXiv:2312.08901, 2023

work page arXiv 2023
[45]

A Survey on In-context Learning

Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Prefix-tuning: Optimizing continuous prompts for generation,

X. L. Li and P . Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597

work page 2021
[47]

Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023

X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of- thought: Large language models can do parallel decoding,” arXiv preprint arXiv:2307.15337, 2023

work page arXiv 2023
[48]

Adaptive skeleton graph decoding,

S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024

work page arXiv 2024
[49]

Apar: Llms can do auto-parallel auto-regressive decoding,

M. Liu, A. Zeng, B. Wang, P . Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024

work page arXiv 2024
[50]

Medusa: Simple llm inference acceleration framework with mul- tiple decoding heads,

T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, “Medusa: Simple llm inference acceleration framework with mul- tiple decoding heads,” 2024

work page 2024
[51]

Efficient memory manage- ment for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory manage- ment for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles , 2023, pp. 611–626

work page 2023
[52]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al. , “Efficiently pro- gramming large language models using sglang,” arXiv preprint arXiv:2312.07104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[54]

Graph of thoughts: Solving elaborate problems with large language models,

M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P . Nyczyk et al. , “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690

work page 2024
[55]

The Rise and Potential of Large Language Model Based Agents: A Survey

Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al. , “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Corex: Pushing the boundaries of complex reasoning through multi- model collaboration,

Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex: Pushing the boundaries of complex reasoning through multi- model collaboration,” arXiv preprint arXiv:2310.00280, 2023

work page arXiv 2023
[57]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi- agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving perfor- mance,” arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

What makes convolutional models great on long sequence modeling?

Y. Li, T. Cai, Y. Zhang, D. Chen, and D. Dey, “What makes convolutional models great on long sequence modeling?” arXiv preprint arXiv:2210.09298, 2022

work page arXiv 2022
[60]

Ckconv: Continuous kernel convolution for sequential data,

D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, and M. Hoogendoorn, “Ckconv: Continuous kernel convolution for sequential data,” arXiv preprint arXiv:2102.02611, 2021

work page arXiv 2021
[61]

Hyena hierarchy: Towards larger convolutional language models,

M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. R´e, “Hyena hierarchy: Towards larger convolutional language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 043–28 078

work page 2023
[62]

RWKV: Reinventing RNNs for the Transformer Era

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al. , “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Retentive Network: A Successor to Transformer for Large Language Models

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Hippo: Recurrent memory with optimal polynomial projections,

A. Gu, T. Dao, S. Ermon, A. Rudra, and C. R ´e, “Hippo: Recurrent memory with optimal polynomial projections,”Advances in neural information processing systems, vol. 33, pp. 1474–1487, 2020

work page 2020
[65]

Combining recurrent, convolutional, and continuous- time models with linear state space layers,

A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. R ´e, “Combining recurrent, convolutional, and continuous- time models with linear state space layers,” Advances in neural information processing systems, vol. 34, pp. 572–585, 2021

work page 2021
[66]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[67]

Diagonal state spaces are as ef- fective as structured state spaces,

A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as ef- fective as structured state spaces,” Advances in Neural Information Processing Systems, vol. 35, pp. 22 982–22 994, 2022

work page 2022
[68]

On the parameterization and initialization of diagonal state space models,

A. Gu, K. Goel, A. Gupta, and C. R ´e, “On the parameterization and initialization of diagonal state space models,” Advances in Neural Information Processing Systems , vol. 35, pp. 35 971–35 983, 2022

work page 2022
[69]

Long range language modeling via gated state spaces,

H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, “Long range language modeling via gated state spaces,” in International Conference on Learning Representations, 2023

work page 2023
[70]

Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052,

D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. R ´e, “Hungry hungry hippos: Towards language modeling with state space models,” arXiv preprint arXiv:2212.14052, 2022

work page arXiv 2022
[71]

Liquid structural state-space models,

R. Hasani, M. Lechner, T.-H. Wang, M. Chahine, A. Amini, and D. Rus, “Liquid structural state-space models,” arXiv preprint arXiv:2209.12951, 2022

work page arXiv 2022
[72]

Sim- plified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022

J. T. Smith, A. Warrington, and S. W. Linderman, “Simpli- fied state space layers for sequence modeling,” arXiv preprint arXiv:2208.04933, 2022

work page arXiv 2022
[73]

Block-state transformers,

J. Pilault, M. Fathi, O. Firat, C. Pal, P .-L. Bacon, and R. Goroshin, “Block-state transformers,” Advances in Neural Information Pro- cessing Systems, vol. 36, 2024

work page 2024
[74]

Pretraining without attention,

J. Wang, J. N. Yan, A. Gu, and A. M. Rush, “Pretraining without attention,” arXiv preprint arXiv:2212.10544, 2022

work page arXiv 2022
[75]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Can mamba learn how to learn? a comparative study on in-context learning tasks,

J. Park, J. Park, Z. Xiong, N. Lee, J. Cho, S. Oymak, K. Lee, and D. Papailiopoulos, “Can mamba learn how to learn? a comparative study on in-context learning tasks,” arXiv preprint arXiv:2402.04248, 2024

work page arXiv 2024
[77]

Fast Transformer Decoding: One Write-Head is All You Need

N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[78]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query trans- former models from multi-head checkpoints,” arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Linformer: Self-Attention with Linear Complexity

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Lin- former: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020. 31

work page internal anchor Pith review Pith/arXiv arXiv 2006
[80]

Lightweight and efficient end-to-end speech recognition using low-rank transformer,

G. I. Winata, S. Cahyawijaya, Z. Lin, Z. Liu, and P . Fung, “Lightweight and efficient end-to-end speech recognition using low-rank transformer,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 6144–6148

work page 2020
[81]

Flurka: Fast fused low-rank & kernel attention,

A. Gupta, Y. Yuan, Y. Zhou, and C. Mendis, “Flurka: Fast fused low-rank & kernel attention,” arXiv preprint arXiv:2306.15799 , 2023

work page arXiv 2023

Showing first 80 references.