Llm in a flash: Efficient large language model inference with limited memory

Keivan Alizadeh, Seyed Iman Mirzadeh, Dmitry Belenko, S Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar · 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

cs.DC · 2026-03-12 · unverdicted · novelty 7.0

This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.

Beyond Scaling: Agents Are Heading to the Edge

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.

citing papers explorer

Showing 2 of 2 citing papers.

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows cs.DC · 2026-03-12 · unverdicted · none · ref 7
This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.
Beyond Scaling: Agents Are Heading to the Edge cs.LG · 2026-05-18 · unverdicted · none · ref 2
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.

Llm in a flash: Efficient large language model inference with limited memory

fields

years

verdicts

representative citing papers

citing papers explorer