LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
read the original abstract
In long context scenarios, large language models (LLMs) face three main challenges: higher computational cost, performance reduction, and position bias. Research indicates that LLM performance hinges on the density and position of key information in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. Our extensive evaluation across various long context scenarios demonstrates that LongLLMLingua not only enhances performance but also significantly reduces costs and latency. For instance, in the NaturalQuestions benchmark, LongLLMLingua boosts performance by up to 21.4% with around 4x fewer tokens in GPT-3.5-Turbo, leading to substantial cost savings. It achieves a 94.0% cost reduction in the LooGLE benchmark. Moreover, when compressing prompts of about 10k tokens at ratios of 2x-6x, LongLLMLingua can accelerate end-to-end latency by 1.4x-2.6x. Our code is available at https://aka.ms/LongLLMLingua.
This paper has not been read by Pith yet.
Forward citations
Cited by 23 Pith papers
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets
Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
TokenMizer builds a knowledge graph of LLM sessions and serializes it into 78-token resume blocks that retain more task, decision, and file information than flat-text baselines at roughly half the token cost.
-
Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?
GroundedCache reduces unsafe-served rate in RAG answer caching to 0-1.5% (vs 15-51.5% naive) via four validation gates while keeping p50 latency within 1.07x of no-cache baseline.
-
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
SemanticZip is a pilot framework introducing LLM-mediated lossy text compression with an experimental interface evaluating six representation regimes on five diagnostic cases for semantic atom recovery and token efficiency.
-
Mapping Text to Multiplex Graph: Prompt Compression as L\'evy Walk-Guided Graph Pruning
RAGP models prompt compression as redundancy-aware pruning on a multiplex graph using Lévy walks, achieving 49.3 average on LongBench at 4x compression versus 48.8 for LongLLMLingua at 3x.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations
Cooperative paging replaces evicted LLM context with keyword bookmarks and adds a recall tool, outperforming six other methods on the LoCoMo benchmark across four models with statistical significance.
-
Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
A unified compressed-sensing framework enables dynamic, task- and token-adaptive structured reduction of LLMs with formal sample-complexity bounds.
-
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
R1-Searcher uses two-stage outcome-based RL to train LLMs to invoke external search systems for better reasoning without process rewards or distillation.
-
LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning
LIFT fine-tunes short-context LLMs on long inputs with synthetic tasks to absorb information into parameters, enabling answers without the input present at inference.
-
CODEPROMPTZIP: Code-specific Prompt Compression for Retrieval-Augmented Generation in Coding Tasks with LMs
CodePromptZip builds a code compressor via type-aware ablation-ranked training samples and a copy-augmented small LM, reporting 23.4-28.7% gains over baselines on three RAG coding tasks.
-
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
-
Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents
CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
-
Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs
LLM-based autonomous semantic compression in four 2D UAV swarm simulations shows potential for efficient collaborative communication under bandwidth constraints.
-
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
-
AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
AdaComp trains a compression-rate predictor on annotated minimum top-k data to adaptively retain only the documents needed for each RAG query.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
-
Token-Operations-Oriented Inference Optimization Techniques for Large Models
The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
-
Retrieval-Augmented Generation for Large Language Models: A Survey
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.