Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.
citing papers explorer
-
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.
-
FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
FaaSMoE treats MoE experts as on-demand FaaS functions with configurable granularity, using under one-third the resources of a full-model baseline under multi-tenant workloads.