TensorFlow-Serving: Flexible, High-Performance ML Serving

Christopher Olston , Noah Fiedel , Kiril Gorovoy , Jeremiah Harmsen , Li Lao , Fangwei Li , Vinu Rajashekhar , Sukriti Ramesh

show 1 more author

Jordan Soyke

Authors on Pith no claims yet

classification 💻 cs.DC cs.LG

keywords flexiblegooglemodelmodelsservingtensorflow-servingaroundavailable

0 comments

read the original abstract

We describe TensorFlow-Serving, a system to serve machine learning models inside Google which is also available in the cloud and via open-source. It is extremely flexible in terms of the types of ML platforms it supports, and ways to integrate with systems that convey new models and updated versions from training to serving. At the same time, the core code paths around model lookup and inference have been carefully optimized to avoid performance pitfalls observed in naive implementations. Google uses it in many production deployments, including a multi-tenant model hosting service called TFS^2.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Memory Management for Large Language Model Serving with PagedAttention
cs.LG 2023-09 conditional novelty 7.0

PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
ERPPO: Entropy Regularization-based Proximal Policy Optimization
cs.LG 2026-05 unverdicted novelty 5.0

ERPPO adds a DSA-based ambiguity estimator to MAPPO and switches between L1 and L2 entropy regularization to improve exploration and stability in non-stationary multi-dimensional observations.
EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge
cs.DC 2026-05 unverdicted novelty 5.0

EdgeServing schedules multi-DNN inference on edge GPUs via time-division sharing and early exits, using a stability score to minimize system-wide SLO violations and P95 latency.