pith. sign in

arxiv: 1811.01564 · v2 · pith:NM2FOENFnew · submitted 2018-11-05 · 💻 cs.LG · stat.ML

Parallel training of linear models without compromising convergence

classification 💻 cs.LG stat.ML
keywords algorithmconvergencetraininglineparallelasynchronouscachecompared
0
0 comments X
read the original abstract

In this paper we analyze, evaluate, and improve the performance of training generalized linear models on modern CPUs. We start with a state-of-the-art asynchronous parallel training algorithm, identify system-level performance bottlenecks, and apply optimizations that improve data parallelism, cache line locality, and cache line prefetching of the algorithm. These modifications reduce the per-epoch run-time significantly, but take a toll on algorithm convergence in terms of the required number of epochs. To alleviate these shortcomings of our systems-optimized version, we propose a novel, dynamic data partitioning scheme across threads which allows us to approach the convergence of the sequential version. The combined set of optimizations result in a consistent bottom line speedup in convergence of up to 12x compared to the initial asynchronous parallel training algorithm and up to 42x, compared to state of the art implementations (scikit-learn and h2o) on a range of multi-core CPU architectures.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.