February 5, 2015

Examples Using AVRO and ORC with Hive and Impala

By Brock Noland

Building off our first post on TEXTFILE and PARQUET, we decided to show examples with AVRO and ORC. AVRO is a row oriented format, while Optimized Row Columnar (ORC) is a format tailored to perform well in Hive. These were executed on CDH 5.2.0 running Hive 0.13.1 + Cloudera back ports.

NOTE: These first few steps can be skipped if you did them in our previous example.

Experience the Differences Between AVRO and ORC

First, let’s gather some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after being uncompressed, is around 1.1GB. Not large by any means, but enough to use in this example.

#unzip, efficiently remove the header , and then add to hdfs.

unzip hourly_TEMP_2014.zip

tail -n +2 hourly_TEMP_2014.csv > hourly_TEMP_2014.csv.tmp && mv -f hourly_TEMP_2014.csv.tmp hourly_TEMP_2014.csv

gzip hourly_TEMP_2014.csv

hadoop fs -copyFromLocal hourly_TEMP_2014.csv.gz /tmp

 

#text – load some initial data (via beeline or Hue)

create table temps_txt (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) row format delimited fields terminated by ‘,’;

load data inpath ‘/tmp/hourly_TEMP_2014.csv.gz’ into table temps_txt;

 

Notice how I’m loading a gziped file. This is a nice feature of the “load data” command. Here is some light reading on compression loads.

#orc – create and load some initial data (via beeline or Hue)

create table temps_orc (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) stored as orc;

insert into table temps_orc select * from temps_txt;

 

#avro – create and load some initial data (via beeline or Hue)

create table temps_avr

stored as avro

tblproperties (‘avro.schema.literal’='{

“name”: “temps”,

“type”: “record”,

“fields”: [

{“name”:”statecode”, “type”:”string”},

{“name”:”countrycode”, “type”:”string”},

{“name”:”sitenum”, “type”:”string”},

{“name”:”paramcode”, “type”:”string”},

{“name”:”poc”, “type”:”string”},

{“name”:”latitude”, “type”:”string”},

{“name”:”longitude”, “type”:”string”},

{“name”:”datum”, “type”:”string”},

{“name”:”param”, “type”:”string”},

{“name”:”datelocal”, “type”:”string”},

{“name”:”timelocal”, “type”:”string”},

{“name”:”dategmt”, “type”:”string”},

{“name”:”timegmt”, “type”:”string”},

{“name”:”degrees”, “type”:”double”},

{“name”:”uom”, “type”:”string”},

{“name”:”mdl”, “type”:”string”},

{“name”:”uncert”, “type”:”string”},

{“name”:”qual”, “type”:”string”},

{“name”:”method”, “type”:”string”},

{“name”:”methodname”, “type”:”string”},

{“name”:”state”, “type”:”string”},

{“name”:”county”, “type”:”string”},

{“name”:”dateoflastchange”, “type”:”string”}

]}’);

insert into table temps_avr select * from temps_txt;

 

#aggregates via hive (beeline or Hue)

select * from temps_orc where state == ‘”California”‘ and dategmt == ‘”2014-09-24″‘ and timegmt == ‘”08:00″‘;

select * from temps_avr where state == ‘”California”‘ and dategmt == ‘”2014-09-24″‘ and timegmt == ‘”08:00″‘;

 

#aggregates via impala(impala-shell)

invalidate metadata;

select max(degrees), state, county from temps_avr group by state, county

 

Trick question; why didn’t we show an example of ORC with Impala? Here is your answer. To learn more, visit our Cloudera services page to see how phData can help your organization.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit