The Truth about SQL on Hadoop (part 1)

This is a multi-part blog post meant to be an exhaustive introduction to SQL-on-Hadoop. The first part in this series will cover Storage Engines and Online Transaction Processing (OLTP). The next post will cover Online Analytical Processing (OLAP) while the third in the series will cover engine retrofits for Hadoop and choosing among the alternatives.

SQL on Hadoop is both an exciting and confusing topic. Nearly every week a new SQL on Hadoop project seems to grab the communities attention, if only for a fleeting moment. In this series we’ll discuss each of the top SQL on Hadoop solutions and have an honest conversation about their architecture, use cases, and more.

There are several broad areas of SQL on Hadoop engines:

  • Data processing and Online Analytical Processing (OLAP)
  • Retrofits
  • Online Transactional Processing (OLTP)

Storage Engines

There are three main storage engines today: Apache HBase, Apache Hadoop HDFS, and Apache Accumulo. Apache Accumulo is similar to HBase but was created by the National Security Agency (NSA) and historically had a strong focus on security, specifically on authorization. In our opinion HBase has now caught up in terms of security features and as such Accumulo won’t be discussed further.

HBase is a distributed key-value store stored as a sorted map. That is data is sorted by key. As we will describe below, HBase is typically used for OLTP applications. HDFS is a file system and can store extremely large datasets in a distributed way.

HBase stores data in HDFS in a format known as an HFile, which is not configurable. When not using HBase and using HDFS directly, users must choose a file format.

When choosing a file format there are many considerations such as:

  • What will be the dominant reading pattern, reading all rows or reading a subset of rows?
  • Is it possible the data will contain any non-ascii text?
  • Which tools will read and write this data (Apache Hive, Spark, etc)?

Broadly speaking there two kinds of file formats used in conjunction with HDFS. Columnar and row-wise. Columnar formats such as RCFile, ORC, and Apache Parquet provide extremely efficient compression (by way of encodings like run-length encoding) and high read performance when a subset of columns from a row is read. For example when you have 50 or 100 columns and you read just 7 or 8 of them. Row-wise formats such as delimited text, Sequence Files, and Apache Avro do not provide as efficient compression and but are better suited to scenarios where jobs will read most of the columns in a table or where data is streaming into a table in small batches.

We recommend ruling out Text, RCFile, and Sequence Files as they are legacy formats and should not be used with the potential exception of legacy systems integration. We don’t recommend these formats because they are easily corruptible (by non-ascii text), poor performing, and outside of text, few tools can read them.

Once you’ve answered the columnar or row-wise question and ruled out legacy formats, the most important question is which tools and engines will be reading and writing this data? The Avro and Parquet projects have been integrated into the largest number of Hadoop ecosystem tools and engines while ORC is the highest performing Apache Hive file format.

Online Transactional Processing (OLTP)

The Apache HBase project provides OLTP type performance with extreme scalability. HBase is the only Hadoop component which is commonly used for online (customer facing) data storage. However, HBase does not aim to be a Relational Database Management System (RDBMS) and thus it’s not a drop in replacement for MySQL, Oracle, or DB2. In fact HBase itself does not provide any SQL interface and users must code in Java, Python, or Ruby to store and retrieve data.

Apache Phoenix is a project which aims to provide OLTP style SQL on top of Apache HBase. Phoenix allows users to perform Inserts, Updates, and Deletes on top of the HBase data model. However, as mentioned earlier the HBase data model is fundamentally different than a traditional RDBMS and as such HBase + Phoenix are still not a drop in replacement for an RDBMS.

HBase (and Phoenix) are useful for uses cases where you are having trouble scaling an application on top of a RDBMS. A traditional solution to a scaling problem in the RDBMS world is to horizontally partition the database. While this solution works it suffers from some drawbacks:

  1. Transactions across partitions are not supported
  2. Adding resources requires a complex and expensive re-partition

Like a partitioned database, HBase does not support transactions, however adding additional resources and re-balancing workloads in HBase is much easier. New nodes can be added to an HBase cluster and HBase will automatically assign partitions to the nodes. Given that both partitioned databases and HBase suffer from the lack of transactions while HBase provides the ability to easily add additional resources. Several companies have used the underlying HBase infrastructure while adding SQL transactions, e.g. Splice Machine.