pith. machine review for the scientific record. sign in

arxiv: 2509.19729 · v2 · submitted 2025-09-24 · 💻 cs.DC

Recognition: unknown

Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services

Authors on Pith no claims yet
classification 💻 cs.DC
keywords contextrequestsamoebainferenceparallelismservicesthroughputdegree
0
0 comments X
read the original abstract

In Large Language Model (LLM) inference services, it is challenging to make a parallelism strategy configuration, to efficiently process the requests of variance context lengths. Requests of long context require high degree of parallelism to provide more memory for Key-Value (KV) Cache, while requests of short context prefer low degree of parallelism to increase concurrency, thus improving throughput. To maintain high throughput while supporting large context lengths on demand, we propose Amoeba, a runtime Tensor Parallel (TP) transformation for online LLM inference services, which adaptively adjusts the TP of running instances to align with the dynamics of incoming requests. Evaluations using real-world traces show that Amoeba improves throughput by 1.75x-6.57x compared to state-of-the-art solutions.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

    cs.DC 2026-04 unverdicted novelty 6.0

    Foundry uses template-based CUDA graph context materialization to reduce LLM serving cold-start latency by up to 99% while preserving CUDA graph throughput gains.