Llm in a flash: Efficient large language model inference with limited memory,

· 2024

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

cs.DC · 2026-05-19 · unverdicted · novelty 5.0

Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.

LLM-Powered AI Agent Systems and Their Applications in Industry

cs.AI · 2025-05-22 · unverdicted · novelty 2.0

A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

citing papers explorer

Showing 2 of 2 citing papers.

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles cs.DC · 2026-05-19 · unverdicted · none · ref 5
Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.
LLM-Powered AI Agent Systems and Their Applications in Industry cs.AI · 2025-05-22 · unverdicted · none · ref 97
A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

Llm in a flash: Efficient large language model inference with limited memory,

fields

years

verdicts

representative citing papers

citing papers explorer