TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Learning to compress prompts with gist tokens
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 3representative citing papers
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
Meta-Soft dynamically synthesizes targeted soft tokens from a learnable meta-library using Gumbel-Softmax and applies attention-flow integration to compress KV cache while attempting to preserve evicted context information.
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
citing papers explorer
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.