January 27, 2015

Examples Using Textfile and Parquet with Hive and Impala

By Brock Noland

Experience the differences between TEXTFILE, PARQUET, Hive and Impala

phData is a fan of simple examples. With that mindset, here is a very quick way for you to get some hands on experience seeing the differences between TEXTFILE and PARQUET, along with Hive and Impala. You can do this on a cluster of your own, or use Cloudera’s Quick Start VM. Our steps were done using a three node CDH 5.2.0 cluster which has Hive 0.13.1 and Impala 2.0.1.

First, gather some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after uncompressed, is around 1GB. Not large by any means, but enough to use in this example.

#unzip, efficiently remove the header from the file and add to hdfs.

unzip hourly_TEMP_2014.zip

tail -n +2 hourly_TEMP_2014.csv > hourly_TEMP_2014.csv.tmp && mv -f hourly_TEMP_2014.csv.tmp hourly_TEMP_2014.csv

hadoop fs -copyFromLocal hourly_TEMP_2014.csv /tmp

Next, log into hive (beeline or Hue), create tables, and load some data. In this example, we’re creating a TEXTFILE table and a PARQUET table. PARQUET is a columnar store that gives us advantages for storing and scanning data. Storing the data column-wise allows for better compression, which gives us faster scans while using less storage. It’s also helpful for “wide” tables and for things like column level aggregations (e.g. avg[degrees]).

#hive (via the beeline shell or Hue)

create table temps_txt (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) row format delimited fields terminated by ‘,’;

load data inpath ‘/tmp/hourly_TEMP_2014.csv’ into table temps_txt;

create table temps_par (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) stored as parquet;

insert into table temps_par select * from temps_txt;

Now that we have some data, let’s do some analysis. In these examples, we are using Hive to select on the TEXTFILE and PARQUET tables. Your results will vary, but with the statements in this example, the PARQUET queries should be faster because of it’s columnar storage approach.

#hive (via the beeline shell or Hue)

select avg(degrees) from temps_txt;

select avg(degrees) from temps_par;

Now, let’s launch the impala-shell and issue the same commands. Impala can use the Hive metastore to query the same tables as you created in Hive.

#impala (via impala-shell)

invalidate metadata;  //needed to see the Hive tables. Here is why.

select avg(degrees) from temps_txt;

select avg(degrees) from temps_par;

Again, these are very small data sets for Hadoop, but give a simple example of how to get up and running so you can see the differences between storage formats (TEXTFILE vs PARQUET) and query engine behavior (Hive vs Impala).

Before we conclude, we want to make sure to clear your mind of any bias. These examples are not intended to favor a file format or query engine. There are good use cases for all the tooling discussed. For example, you’re going to notice that Impala is faster than Hive in these simple examples. Don’t immediately feel that query performance is the only aspect to consider. The community feels there is a rich future for both query engines.

To learn more, visit our Cloudera services page to see how phData can help your organization.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit