Examples Using AVRO and ORC with Hive and Impala

Building off our first post on TEXTFILE and PARQUET, we decided to show examples with AVRO and ORC. AVRO is a row oriented format, while Optimized Row Columnar (ORC) is a format tailored to perform well in Hive. These were executed on CDH 5.2.0 running Hive 0.13.1 + Cloudera back ports.

NOTE: These first few steps can be skipped if you did them in our previous example.

Experience the Differences Between AVRO and ORC

First, let’s gather some data. In this example, we grabbed temperature data from the US government. We took the ‘hourly_TEMP_2014.zip’ data, which after being uncompressed, is around 1.1GB. Not large by any means, but enough to use in this example.

#unzip, efficiently remove the header , and then add to hdfs.

unzip hourly_TEMP_2014.zip

tail -n +2 hourly_TEMP_2014.csv > hourly_TEMP_2014.csv.tmp && mv -f hourly_TEMP_2014.csv.tmp hourly_TEMP_2014.csv

gzip hourly_TEMP_2014.csv

hadoop fs -copyFromLocal hourly_TEMP_2014.csv.gz /tmp

#text – load some initial data (via beeline or Hue)

create table temps_txt (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) row format delimited fields terminated by ‘,’;

load data inpath ‘/tmp/hourly_TEMP_2014.csv.gz’ into table temps_txt;

Notice how I’m loading a gziped file. This is a nice feature of the “load data” command. Here is some light reading on compression loads.

#orc – create and load some initial data (via beeline or Hue)

create table temps_orc (statecode string, countrycode string, sitenum string, paramcode string, poc string, latitude string, longitude string, datum string, param string, datelocal string, timelocal string, dategmt string, timegmt string, degrees double, uom string, mdl string, uncert string, qual string, method string, methodname string, state string, county string, dateoflastchange string) stored as orc;

insert into table temps_orc select * from temps_txt;

#avro – create and load some initial data (via beeline or Hue)

create table temps_avr

stored as avro

tblproperties (‘avro.schema.literal’='{

“name”: “temps”,

“type”: “record”,

“fields”: [

{“name”:”statecode”, “type”:”string”},

{“name”:”countrycode”, “type”:”string”},

{“name”:”sitenum”, “type”:”string”},

{“name”:”paramcode”, “type”:”string”},

{“name”:”poc”, “type”:”string”},

{“name”:”latitude”, “type”:”string”},

{“name”:”longitude”, “type”:”string”},

{“name”:”datum”, “type”:”string”},

{“name”:”param”, “type”:”string”},

{“name”:”datelocal”, “type”:”string”},

{“name”:”timelocal”, “type”:”string”},

{“name”:”dategmt”, “type”:”string”},

{“name”:”timegmt”, “type”:”string”},

{“name”:”degrees”, “type”:”double”},

{“name”:”uom”, “type”:”string”},

{“name”:”mdl”, “type”:”string”},

{“name”:”uncert”, “type”:”string”},

{“name”:”qual”, “type”:”string”},

{“name”:”method”, “type”:”string”},

{“name”:”methodname”, “type”:”string”},

{“name”:”state”, “type”:”string”},

{“name”:”county”, “type”:”string”},

{“name”:”dateoflastchange”, “type”:”string”}

]}’);

insert into table temps_avr select * from temps_txt;

#aggregates via hive (beeline or Hue)

select * from temps_orc where state == ‘”California”‘ and dategmt == ‘”2014-09-24″‘ and timegmt == ‘”08:00″‘;

select * from temps_avr where state == ‘”California”‘ and dategmt == ‘”2014-09-24″‘ and timegmt == ‘”08:00″‘;

#aggregates via impala(impala-shell)

invalidate metadata;

select max(degrees), state, county from temps_avr group by state, county

Trick question; why didn’t we show an example of ORC with Impala? Here is your answer. To learn more, visit our Cloudera services page to see how phData can help your organization.

Examples Using AVRO and ORC with Hive and Impala

Experience the Differences Between AVRO and ORC

More to explore

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

The GenAI Strategy Playbook

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning