Aibrix: Towards scalable, cost-effective large language model inference infrastructure

The AIBrix Team et al · 2025 · arXiv 2504.03648

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

cs.DC · 2026-06-11 · unverdicted · novelty 6.0

Maestro is a workload-aware scheduler for LLM-based multi-agent systems that cuts KV-reservation HBM by 67.2% and raises high-contention SLO attainment by 23.6 points over EDF via prediction-driven hierarchical scheduling.

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

cs.DC · 2026-04-16 · unverdicted · novelty 6.0

Scepsy schedules arbitrary multi-LLM agentic workflows on GPU clusters by constructing Aggregate LLM Pipelines from stable per-LLM execution time shares, then searching fractional GPU allocations, tensor parallelism, and replica counts to achieve up to 2.4x higher throughput and 27x lower latency.

CascadeInfer: Length-Aware Scheduling of LLM Serving with Low Latency and Load Balancing

cs.DC · 2025-12-22 · conditional · novelty 6.0

CascadeInfer partitions LLM instances into length-specialized groups, uses dynamic programming for stage partitioning, and applies runtime refinement plus decentralized load balancing to cut latency and raise throughput.

RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching

cs.DC · 2026-05-08 · unverdicted · novelty 5.0

RcLLM accelerates generative recommendation inference by 1.31x-9.51x in TTFT through beyond-prefix KV caching, replicated user caches, sharded item caches, affinity scheduling, and selective attention with negligible accuracy loss.

Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

cs.LG · 2025-09-25 · unverdicted · novelty 5.0

Learning-augmented LRU achieves 1-consistency and O(k)-robustness for GPU caching with low overhead, implemented in LCR to cut P99 TTFT by up to 28.3% on LLM workloads and raise throughput by up to 24.2% on DLRM workloads.

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources

cs.DC · 2026-05-16 · unverdicted · novelty 4.0

GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Aibrix: Towards scalable, cost-effective large language model inference infrastructure

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer