Distributed Clustering via LSH Based Data Partitioning

ICML 2018 · Aditya Bhaskara, Maheshakya Wijewardena ·

Given the importance of clustering in the analysisof large scale data, distributed algorithms for formulations such as k-means, k-median, etc. have been extensively studied. A successful approach here has been the “reduce and merge” paradigm, in which each machine reduces its input size to {Õ}(k), and this data reduction continues (possibly iteratively) until all the data fits on one machine, at which point the problem is solved locally. This approach has the intrinsic bottleneck that each machine must solve a problem of size $\geq$ k, and needs to communicate at least $\Omega$(k) points to the other machines. We propose a novel data partitioning idea to overcome this bottleneck, and in effect, have different machines focus on “finding different clusters”. Under the assumption that we know the optimum value of the objective up to a poly(n) factor (arbitrary polynomial), we establish worst-case approximation guarantees for our method. We see that our algorithm results in lower communication as well as a near-optimal number of ‘rounds’ of computation (in the popular MapReduce framework).

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Clustering

Datasets

FMA

Results from the Paper

Add Remove

Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Distributed Clustering via LSH Based Data Partitioning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove