Blog

Apache Kafka Performance Numbers

A search for “Apache Kafka performance” will result in dozens of articles but few results useful for estimating real-world results. Specifically I’ve few found results which run on hardware common to modern data centers, replicate the data with the common factor of 3, and many parallel producers and consumers. These results are meant to be used […]

Read More

Data Redaction in Hadoop

One of the growing trends we continue to see is concerns around properly handling personally identifiable information (PII).   Social Security Numbers, credit card numbers, passport numbers, etc,. It’s always been there, but with the recent large exposures, every company is going back to make sure data is being handled properly. Back in a former life, […]

Read More

Rolling Hadoop Upgrades

For those of us working in Hadoop operations, Hortonworks has a great read on Hadoop’s evaluation into a ever-breathing organism. Or in other words, as Hadoop continues to grow into the backbone of data centers, having it up 24/7/365 is a must. The article goes through the process of rolling upgrades. One of the critical […]

Read More

How Many Hadoop Clusters Should A Company Have?

One of the questions we get asked is “How many Hadoop clusters should we have?” And like all good technology answers, we generally respond with “It depends.” That being said, here are a few general rules we’ve seen applied across enterprise organizations. PROD vs non-PROD – Many organizations physically separate PROD and non-PROD infrastructure with […]

Read More

Operational Notes on Apache Phoenix

We recently worked with a client who wanted to test using Apache Phoenix on of their pre-PROD HBase cluster. If your’e not familiar with Phoenix, it’s a SQL interface that sits nicely on top HBase, ultimately making HBase look like a relational database. The group we were working with was focused on operations so in […]

Read More

Data Center Replication with Accumulo 1.7.0

The Accumulo blog has a nice write-up explaining their data center replication feature coming in 1.7.0. Nearly every day we’re asked how data can be efficiently managed across geographically separated data centers. This was hard enough with small amounts of data, and becoming ever more difficult when we’re talking terabytes or petabytes. They explain some […]

Read More

Introduction to Apache Spark GraphX – Part 2

For is a follow-up to Introduction to Apache Spark GraphX – Part 1, we decided we’d traverse a real-world graph. As developers, we use Github, which hosts public git repositories. Caught-up in the “social” buzz, Github has added social features. In addition to having a public profile which optionally lists your location, they allow you […]

Read More

Getting started with Apache Spark GraphX – Part 1

This is two part series on Apache Spark GraphX. The second part is located here. From DNA sequencing to anti-terrorism efforts, graph computation is increasingly ubiquitous in today’s world. Graph’s have vertices and edges. The vertices are the entities in the graph while edges are connections between the entities. Graphs can be symmetric or directional. […]

Read More

Exploring Spark MLlib: Part 3 – Transformation and Model Creation

In the first two(1,2) posts we got the data ingested and explored the data with the spark-shell. Now we’ll move on to creating and submitting our code as standalone Spark application. Again all the code cover in the posts can be found here. We’ll start by creating a case class and a function for parsing […]

Read More