Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Chien-Chin Huang; Jinyang Li; Minjie Wang

arxiv: 1807.08887 · v2 · pith:JA3B6ZJRnew · submitted 2018-07-24 · 💻 cs.DC · cs.LG

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Minjie Wang , Chien-Chin Huang , Jinyang Li This is my paper

classification 💻 cs.DC cs.LG

keywords largemodelstofuverydataflowgraphpartitionoperator

0 comments

read the original abstract

This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators in order to work transparently with a general-purpose deep learning platform like MXNet. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language which represents tensors as lambda functions mapping from tensor coordinates to values. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.

This paper has not been read by Pith yet.

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

discussion (0)