Apache Hive and complementary technologies, such as Cloudera Impala, provide scalable SQL on Apache Hadoop. Unlike legacy database systems, Hive and Impala have traditionally not provided any update functionality. However, many use cases require periodically updating rows such as slowly changing dimension tables.
SQL on Hadoop technologies typically utilize one of two storage engines: Apache HBase or Apache Hadoop HDFS. Each storage engine excels at different use cases and have vastly different performance characteristics. Storage engines are covered in detail in our detailed SQL on Hadoop series.
Strategy #1: Use Apache HBase
HBase and Apache Accumulo provide the ability to perform updates and when update functionality is required, using HBase as a storage engine seems like a natural fit. This could be a disastrous decision due a fundamental impedance mismatch between the performance characteristics that most Hive use cases require and what HBase provides.
HBase provides a highly performant data store for random writes and random reads. However, HBase does not excel at large scans, such as full table scans or scanning tens of thousands of rows. This is in contrast with Hive, which either scans a full table or full set of partitions for each query. The result is that using Hive on HBase should be used conservatively. When a table is small, this integration can work well, but Hive on HBase will not perform well on large tables.
There are two caveats the guidelines above. Hive has recently introduced the ability to query HBase Snapshots which, as files on HDFS, will provide significantly improved large scan performance. Additionally, Hive will eliminate HBase partitions if the query contains row key as predicates. For example, assume the row keys on the table are 0001 through 9999 and the table is partitioned into two regions 0001-4999 and 5000-9999. If a query specifies the predicate rowKey > 5000, then only the second region will be scanned as part of the Hive query.
Strategy #2: Reload data from the source
Another option is reloading the entire dataset from the source. This approach works well in scenarios where relatively small dimension tables can be completely refreshed. phData recommends this approach, if possible, as it’s the simplest of the four strategies and is least likely to cause operational incidents. Of course, completely reloading large tables, such as fact tables, is often prohibitive.
Before proceeding, consider the size of the dataset being reloaded, the projected growth of the table, how often and when the dataset will be extracted, and the load on the system at that time. This is simply a long winded list of reasons to ask your friendly database administrators if this strategy is acceptable for a given dataset. A common dataset which would fit these requirements is 1GB, growing at 10% a year, extracted during off hours once a week when the database is nearly idle.
One final note on this strategy. There are three types of database connectors for Hadoop: generic JDBC connectors, free (but not open source) vendor-specific connectors, and licensable vendor specific connectors. Performance follows in order with the licensable vendor specific connector being the most performant. Depending on database size, this could be a reasonable investment.
However, most customers use Apache Sqoop for data exports with either the generic JDBC connector or the free vendor-specific connector. When performing a periodic full data export, we strongly recommend that you use either the paid or free database specific connector and that the target table is a table or materialized view.
Strategy #3: Perform incremental merges
If the data cannot be completely reloaded, an incremental merge is a time tested strategy which has been implemented at thousands of locations. In fact, if the data cannot be completely reloaded, an incremental merge is phData’s recommended strategy. Hortonworks has an excellent implementation guide. While this strategy is more work than the “ACID” feature described in the following section, it’s been tested over time and has well known performance and operational characteristics.
Strategy #4: Use Apache Hive 0.14’s “ACID” feature
The latest version of Apache Hive, 0.14, has added a feature titled “ACID,” which provides the ability to insert single, update, and delete rows. The term “ACID” has been quoted throughout this post because it’s important to distinguish between what this feature provides and what an OLTP or a transactional OLAP database provides. For example, performing a single insert, update, or delete once per second would result in a Hive “ACID” system falling on its face. A more user friendly name for this feature might be “bulk update”.
In the long term, this feature may provide an easy and performant method of performing updates to Hive tables. Today, the feature is not complete. For example, today each statement is auto-committed and there is no ability to start a transaction, update records by individual update statements, and then commit the transaction.
This feature works by first assigning each row a row ID and then when rows are updated, storing the updates in delta files. When data is read, Hive merges in the appropriate delta files and picks the latest version of a row. Updating a few thousand records as 50,000 individual update statements would result in an explosion of “delta” files and perform terribly. Under the current implementation, those 50k updates would have to be coalesced into as few update statements as possible or be issued over a long period of time to allow compactions to keep up. If compactions cannot keep up with the rate of inserts, performance will suffer as each read will have to merge many delta files.
Again, as this feature is brand new, users should approach Hive “ACID” cautiously.