OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
hub Mixed citations
Llmlingua: Compressing prompts for accelerated inference of large language models
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
A unified compressed-sensing framework enables dynamic, task- and token-adaptive structured reduction of LLMs with formal sample-complexity bounds.
ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to budget-matched baselines.
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
ONTO notation reduces LLM input tokens by 46-51% versus JSON on synthetic operational datasets while preserving task accuracy through a schema-once columnar design with hierarchical support.
LEI framework uses a cloud LLM to dynamically create and update tailored lightweight programs for heterogeneous edge devices, shown on four sensor datasets to maintain low CPU and memory use while adapting to changes.
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
AdaComp trains a compression-rate predictor on annotated minimum top-k data to adaptively retain only the documents needed for each RAG query.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
citing papers explorer
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting for runtime conditions.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
A unified compressed-sensing framework enables dynamic, task- and token-adaptive structured reduction of LLMs with formal sample-complexity bounds.
-
Learning to Configure Agentic AI Systems
ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to budget-matched baselines.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
-
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
-
Budget-Aware Routing for Long Clinical Text
RCD balances relevance, coverage, and diversity in a knapsack-constrained selection framework, with experiments showing that selector choice and budget level determine optimal unitization strategies on clinical datasets.
-
ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization
ONTO notation reduces LLM input tokens by 46-51% versus JSON on synthetic operational datasets while preserving task accuracy through a schema-once columnar design with hierarchical support.
-
LLM-assisted Agentic Edge Intelligence Framework
LEI framework uses a cloud LLM to dynamically create and update tailored lightweight programs for heterogeneous edge devices, shown on four sensor datasets to maintain low CPU and memory use while adapting to changes.
-
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
-
AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
AdaComp trains a compression-rate predictor on annotated minimum top-k data to adaptively retain only the documents needed for each RAG query.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
Byte-exact deduplication reduces RAG context size by 0.16% to 80.34% across three regimes with zero measurable quality regression per multi-vendor LLM evaluation.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
- Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression