Coded Computation over Heterogeneous Clusters

21 Jan 2017  ·  Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, Amir Salman Avestimehr ·

In large-scale distributed computing clusters, such as Amazon EC2, there are several types of "system noise" that can result in major degradation of performance: bottlenecks due to limited communication bandwidth, latency due to straggler nodes, etc. On the other hand, these systems enjoy abundance of redundancy - a vast number of computing nodes and large storage capacity. There have been recent results that demonstrate the impact of coding for efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in homogeneous clusters. In this paper, we focus on general heterogeneous distributed computing clusters consisting of a variety of computing machines with different capabilities. We propose a coding framework for speeding up distributed computing in heterogeneous clusters by trading redundancy for reducing the latency of computation. In particular, we propose Heterogeneous Coded Matrix Multiplication (HCMM) algorithm for performing distributed matrix multiplication over heterogeneous clusters that is provably asymptotically optimal for a broad class of processing time distributions. Moreover, we show that HCMM is unboundedly faster than any uncoded scheme. To demonstrate practicality of HCMM, we carry out experiments over Amazon EC2 clusters where HCMM is found to be up to $61\%$, $46\%$ and $36\%$ respectively faster than three benchmark load allocation schemes - Uniform Uncoded, Load-balanced Uncoded, and Uniform Coded. Additionally, we provide a generalization to the problem of optimal load allocation in heterogeneous settings, where we take into account the monetary costs associated with the clusters. We argue that HCMM is asymptotically optimal for budget-constrained scenarios as well, and we develop a heuristic algorithm for (HCMM) load allocation for budget-limited computation tasks.

PDF Abstract


Distributed, Parallel, and Cluster Computing Information Theory Information Theory


  Add Datasets introduced or used in this paper