pith. sign in

hub

Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

hub tools

citation-role summary

background 3 method 1

citation-polarity summary

clear filters

representative citing papers

KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding

cs.DC · 2026-06-28 · unverdicted · novelty 5.0

KernelFlume presents a disaggregated decode architecture that separates core attention from projection/FFN paths to enable elastic scaling of attention nodes, reporting up to 61% lower cost per million tokens versus full-instance scaling on H100 hardware for Llama-3.1-8B under dynamic long-context w

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 1 of 1 citing paper after filters.