pith. sign in

arxiv: 2107.04140 · v3 · pith:FYGZ5IGGnew · submitted 2021-07-08 · 💻 cs.AR

First-Generation Inference Accelerator Deployment at Facebook

Michael Anderson , Benny Chen , Stephen Chen , Summer Deng , Jordan Fix , Michael Gschwind , Aravind Kalaiah , Changkyu Kim
show 106 more authors
This is my paper
classification 💻 cs.AR
keywords platformacceleratorinferencefacebookdeploymentacceleratorscomputedescribe
0
0 comments X
read the original abstract

In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements. We describe the inference accelerator platform ecosystem we developed and deployed at Facebook: both hardware, through Open Compute Platform (OCP), and software framework and tooling, through Pytorch/Caffe2/Glow. A characteristic of this ecosystem from the start is its openness to enable a variety of AI accelerators from different vendors. This platform, with six low-power accelerator cards alongside a single-socket host CPU, allows us to serve models of high complexity that cannot be easily or efficiently run on CPUs. We describe various performance optimizations, at both platform and accelerator level, which enables this platform to serve production traffic at Facebook. We also share deployment challenges, lessons learned during performance optimization, as well as provide guidance for future inference hardware co-design.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FP8 Formats for Deep Learning

    cs.LG 2022-09 unverdicted novelty 6.0

    FP8 formats E4M3 and E5M2 match 16-bit training accuracy on CNNs, RNNs, and Transformers up to 175B parameters without hyperparameter changes.