A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

31 Mar 2018 · Ashish Tapdiya, Daniel Fabbri ·

Hadoop is emerging as the primary data hub in enterprises, and SQL represents the de facto language for data analysis. This combination has led to the development of a variety of SQL-on-Hadoop systems in use today. While the various SQL-on-Hadoop systems target the same class of analytical workloads, their different architectures, design decisions and implementations impact query performance. In this work, we perform a comparative analysis of four state-of-the-art SQL-on-Hadoop systems (Impala, Drill, Spark SQL and Phoenix) using the Web Data Analytics micro benchmark and the TPC-H benchmark on the Amazon EC2 cloud platform. The TPC-H experiment results show that, although Impala outperforms other systems (4.41x - 6.65x) in the text format, trade-offs exists in the parquet format, with each system performing best on subsets of queries. A comprehensive analysis of execution profiles expands upon the performance results to provide insights into performance variations, performance bottlenecks and query execution characteristics.

PDF Abstract

Code

Add Remove Mark official

meacial/e-book

Datasets

Add Datasets introduced or used in this paper

Edit Social Preview

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

Code Edit Add Remove Mark official

Categories

Datasets Edit

Code

Add Remove Mark official

Datasets