Learning to compress prompts with gist tokens

[Muet al · 2023 · arXiv 2304.08467

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 3

representative citing papers

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation

cs.SE · 2026-04-15 · unverdicted · novelty 6.0

Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

cs.CL · 2023-10-03 · conditional · novelty 6.0

FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

cs.LG · 2023-06-24 · unverdicted · novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

cs.AI · 2026-05-21 · unverdicted · novelty 4.0 · 2 refs

Meta-Soft dynamically synthesizes targeted soft tokens from a learnable meta-library using Gumbel-Softmax and applies attention-flow integration to compress KV cache while attempting to preserve evicted context information.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 2 of 2 citing papers after filters.

On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation cs.SE · 2026-04-15 · unverdicted · none · ref 25
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 34
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Learning to compress prompts with gist tokens

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer