pith. sign in

arxiv: 2505.07833 · v2 · pith:YIVXF3SSnew · submitted 2025-05-01 · 💻 cs.DC · cs.AI· cs.MA· cs.OS

Harmonia: End-to-End RAG Serving Optimization

classification 💻 cs.DC cs.AIcs.MAcs.OS
keywords harmoniaservingcomponentsend-to-endinferenceviolationsacrossaddresses
0
0 comments X
read the original abstract

Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for composing custom workflows, (ii) heterogeneity-aware deployment that provisions and configures components as a distributed inference system, and (iii) a closed-loop runtime controller that monitors load and execution progress and reduces SLO violations through request prioritization and auto-scaling. Across four RAG applications, Harmonia outperforms commercial alternatives, improving throughput by more than 2.04x while reducing SLO violations by up to 78.4 percent.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.