LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
read the original abstract
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.
This paper has not been read by Pith yet.
Forward citations
Cited by 32 Pith papers
-
StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good h...
-
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
-
Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets
Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.
-
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
End-to-End Context Compression at Scale
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
-
HMARS: A Hierarchical Multi-Agent Memory System for Long-Context Reasoning
HMARS introduces a hierarchical multi-agent memory system that outperforms standard retrieval and other baselines on long-document and multi-turn reasoning tasks through improved evidence coverage.
-
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
Language models can use a two-stage sleep process of upward distillation for memory consolidation and RL-based dreaming for unsupervised self-improvement to enable continual learning.
-
Mapping Text to Multiplex Graph: Prompt Compression as L\'evy Walk-Guided Graph Pruning
RAGP models prompt compression as redundancy-aware pruning on a multiplex graph using Lévy walks, achieving 49.3 average on LongBench at 4x compression versus 48.8 for LongLLMLingua at 3x.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
A unified compressed-sensing framework enables dynamic, task- and token-adaptive structured reduction of LLMs with formal sample-complexity bounds.
-
Learning to Configure Agentic AI Systems
ARC learns per-query configurations for LLM agent systems via a lightweight hierarchical policy in an SMDP formulation, delivering 31% higher reasoning accuracy and doubled success on an agent benchmark over budget-ma...
-
Learning to Configure Agentic AI Systems
ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to ...
-
Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
Extra-CoT trains a semantic compressor on math CoT data, applies mixed-ratio SFT, and uses CHRPO reinforcement learning to achieve over 73% token reduction on MATH-500 with 0.6% accuracy gain on Qwen3-1.7B.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
-
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive c...
-
Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
Budget-Aware Routing for Long Clinical Text
RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
-
ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization
ONTO notation reduces LLM input tokens by 46-51% versus JSON on synthetic operational datasets while preserving task accuracy through a schema-once columnar design with hierarchical support.
-
LLM-assisted Agentic Edge Intelligence Framework
LEI framework uses a cloud LLM to dynamically create and update tailored lightweight programs for heterogeneous edge devices, shown on four sensor datasets to maintain low CPU and memory use while adapting to changes.
-
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
-
AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
AdaComp trains a compression-rate predictor on annotated minimum top-k data to adaptively retain only the documents needed for each RAG query.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair
ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.
-
Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
On a hotel expense benchmark, pruning LLM agent context to the last 5 tool pairs plus summarization raises completion to 91.6% and cuts tokens by ~63% compared with retaining full conversation history.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
-
Token-Operations-Oriented Inference Optimization Techniques for Large Models
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.