Blog

Getting Into The Cloud With CDH In Minutes

A few weeks ago we wrote an article on the pros and cons of running your Hadoop capabilities in the Cloud compared to on-premise.  The conclusion was that there isn’t a right or wrong answer.  If you’re investing in data centers, it probably makes sense to run Hadoop on-premise.  If you’re not investing in data […]

Read More

Example Self Contained Spark Application

For the past few weeks we’ve showed some simple examples on how to use Hive and Impala with different file formats along with partitioning. These approaches have exercised two very popular interfaces into Hadoop – namely Map/Reduce with Hive and Impala. We’re now shifting gears to introduce a new up and comer – namely Spark. […]

Read More

Hands On Example With Hive Partitioning

Building off our Simple Examples Series, we wanted to take five minutes and show you how to recognize the power of partitioning.  For a more detailed article on partitioning, Cloudera had a nice blog write-up, including some pointers. http://blog.cloudera.com/blog/2014/08/improving-query-performance-using-partitioning-in-apache-hive/ One of the pointers that should resonate is the cardinality of the column, which is another […]

Read More

Binary Stream Ingest: Flume vs Kafka vs Kinesis

Introduction The internet of things will put new demands on Hadoop ingest methods, specifically in its ability to capture raw sensor data — binary streams. As discussed, big data will remove previous data storage constraints and allow streaming of raw sensor data at granularities dictated by the sensors themselves. The focus of this post will […]

Read More

Examples Using AVRO and ORC with Hive and Impala

Building off our first post on TEXTFILE and PARQUET, we decided to show examples with AVRO and ORC.  AVRO is a row oriented format, while Optimized Row Columnar (ORC) is a format tailored to perform well in Hive.  These were executed on CDH 5.2.0 running Hive 0.13.1 + Cloudera back ports. NOTE:  These first few […]

Read More

Build a Hadoop Distribution like Hortonworks or Cloudera

Apache Hadoop and much of it’s ecosystem are free and open source. Due to it’s free nature, customers often ask, why would I need a distribution such as Cloudera, Hortonworks, or MapR? Indeed, some users of Hadoop do not use a distribution. Yahoo and Facebook for example build a distribution for internal use. The purpose […]

Read More

Examples Using TEXTFILE and PARQUET with Hive and Impala

phData is a fan of simple examples.  With that mind set, here is a very quick way for you to get some hands on experience seeing the differences between TEXTFILE and PARQUET, along with Hive and Impala.  You can do this on a cluster of your own, or use Cloudera’s Quick Start VM.  Our steps […]

Read More

Hadoop Versions In Vendor Distributions

Hadoop distributions typically come with between 20-30 open source projects, all bundled together to make a “big data platform” enterprises can deploy and maintain in a sustained manor. E.g. Hadoop core, HBase, Hive, Spark, etc. The foundation is Hadoop core, with the others sitting alongside or on top. Two common questions come up with enterprises […]

Read More

4 Strategies for Updating Hive Tables

Apache Hive and complementary technologies such as Cloudera Impala provide scalable SQL on Apache Hadoop. Unlike legacy database systems Hive and Impala have traditionally not provided any update functionality. However, many use cases require periodically updating rows such as slowly changing dimension tables. SQL on Hadoop technologies typically utilize one of two storage engines, Apache HBase […]

Read More

The Paradox of Agile Data Management

At phData, many of us come from a software development background and have witnessed the success of Agile Methodologies. Agile started in software development where it quickly gained popularity, but has also now made inroads into other realms. The concept of the “Agile Admin”, or as it’s better known, Devops, takes many of its core […]

Read More