Title resolution pending

Difan Tan, Jiawei Li, Hua Wang, Xiaoxiao Li, Wenbo Liu, Zijin Qin, Ke Zhou, Ming Xie, Mengling Tao · 2025 · arXiv 9940.370725

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

cs.PF · 2026-04-20 · unverdicted · novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

cs.DC · 2026-04-17 · unverdicted · novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

cs.DC · 2026-04-08 · unverdicted · novelty 6.0

Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

cs.DC · 2025-09-24 · unverdicted · novelty 6.0

Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.

citing papers explorer

Showing 5 of 5 citing papers.

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics cs.OS · 2026-05-18 · unverdicted · none · ref 76
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing cs.PF · 2026-04-20 · unverdicted · none · ref 36
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving cs.DC · 2026-04-17 · unverdicted · none · ref 50
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start cs.DC · 2026-04-08 · unverdicted · none · ref 38
Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.
Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services cs.DC · 2025-09-24 · unverdicted · none · ref 23
Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer