pith. sign in

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

fields

cs.DC 2 cs.LG 2

years

2026 4

representative citing papers

MCAP: Deployment-Time Layer Profiling for Memory-Constrained LLM Inference

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

MCAP uses load-time Monte Carlo profiling to estimate layer importance, enabling dynamic quantization (W4A8 vs W4A16) and memory tiering (GPU/RAM/SSD) that delivers 1.5-1.8x higher decode throughput than llama-cpp Q4_0 on NVIDIA T4 while fitting models into previously infeasible memory budgets.

citing papers explorer

Showing 4 of 4 citing papers.