NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
Smith, Gokul Subramanian Ravi, Jonathan M
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
A resource estimation framework for distributed fault-tolerant quantum computers based on lattice surgery identifies feasible hardware configurations for eight applications across thousands of setups, showing that architecture design must be guided by resource analysis for scalability.
citing papers explorer
-
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
-
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
-
Architecting Distributed Quantum Computers: Design Insights from Resource Estimation
A resource estimation framework for distributed fault-tolerant quantum computers based on lattice surgery identifies feasible hardware configurations for eight applications across thousands of setups, showing that architecture design must be guided by resource analysis for scalability.