Examples Using AVRO and ORC with Hive and Impala

Laptop Desk

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Building off our first post on TEXTFILE and PARQUET, we decided to show examples with AVRO and ORC. AVRO is a row oriented format, while Optimized Row Columnar (ORC) is a format tailored to perform well in Hive. These were executed on CDH 5.2.0 running Hive 0.13.1 + Cloudera back ports.

NOTE: These first few steps can be skipped if you did them in our previous example.

First, let’s gather some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after being uncompressed, is around 1.1GB. Not large by any means, but enough to use in this example.

#unzip, efficiently remove the header , and then add to hdfs.

unzip hourly_TEMP_2014.zip

tail -n +2 hourly_TEMP_2014.csv > hourly_TEMP_2014.csv.tmp && mv -f hourly_TEMP_2014.csv.tmp hourly_TEMP_2014.csv

gzip hourly_TEMP_2014.csv

hadoop fs -copyFromLocal hourly_TEMP_2014.csv.gz /tmp

 

#text – load some initial data (via beeline or Hue)

create table temps_txt (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) row format delimited fields terminated by ‘,’;

load data inpath ‘/tmp/hourly_TEMP_2014.csv.gz’ into table temps_txt;

 

Notice how I’m loading a gziped file. This is a nice feature of the “load data” command. Here is some light reading on compression loads.

#orc – create and load some initial data (via beeline or Hue)

create table temps_orc (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) stored as orc;

insert into table temps_orc select * from temps_txt;

 

#avro – create and load some initial data (via beeline or Hue)

create table temps_avr

stored as avro

tblproperties (‘avro.schema.literal’='{

“name”: “temps”,

“type”: “record”,

“fields”: [

{“name”:”statecode”, “type”:”string”},

{“name”:”countrycode”, “type”:”string”},

{“name”:”sitenum”, “type”:”string”},

{“name”:”paramcode”, “type”:”string”},

{“name”:”poc”, “type”:”string”},

{“name”:”latitude”, “type”:”string”},

{“name”:”longitude”, “type”:”string”},

{“name”:”datum”, “type”:”string”},

{“name”:”param”, “type”:”string”},

{“name”:”datelocal”, “type”:”string”},

{“name”:”timelocal”, “type”:”string”},

{“name”:”dategmt”, “type”:”string”},

{“name”:”timegmt”, “type”:”string”},

{“name”:”degrees”, “type”:”double”},

{“name”:”uom”, “type”:”string”},

{“name”:”mdl”, “type”:”string”},

{“name”:”uncert”, “type”:”string”},

{“name”:”qual”, “type”:”string”},

{“name”:”method”, “type”:”string”},

{“name”:”methodname”, “type”:”string”},

{“name”:”state”, “type”:”string”},

{“name”:”county”, “type”:”string”},

{“name”:”dateoflastchange”, “type”:”string”}

]}’);

insert into table temps_avr select * from temps_txt;

 

#aggregates via hive (beeline or Hue)

select * from temps_orc where state == ‘”California”‘ and dategmt == ‘”2014-09-24″‘ and timegmt == ‘”08:00″‘;

select * from temps_avr where state == ‘”California”‘ and dategmt == ‘”2014-09-24″‘ and timegmt == ‘”08:00″‘;

 

#aggregates via impala(impala-shell)

invalidate metadata;

select max(degrees), state, county from temps_avr group by state, county

 

Trick question; why didn’t we show an example of ORC with Impala? Here is your answer.

More To Explore

Data Science Salon Seattle Conference

Event Recap: Data Science Salon Seattle

Photo by Anna Anisin. Last week, I had the pleasure of attending the Data Science Salon conference in Seattle, Washington. It was my first time

Want to learn more about phData?

Image of desk