Hadoop

Getting Started with Kudu

Five years ago, enabling Data Science and Advanced Analytics on the Hadoop platform was hard. Organizations required strong Software Engineering capabilities to successfully implement complex Lambda architectures or even simply implement continuous ingest. Updating or deleting data, were simply nightmare. General Data Protection Regulation (GDPR) would have been an extreme challenge at that time.   […]

Read More

Parquet vs Text Compression

Parquet is a columnar data format. Columnar data formats, which store data grouped by columns, when tuned specifically for a given dataset can achieve compression ratios of up to 95%. However, with zero tuning, they still provide excellent compression. For example, below I use Faker to generate 1M rows: from faker import Factory fake = […]

Read More

Troubleshooting Spark and Kafka escalations for our managed services customers

My favorite technical task is handling a troubleshooting escalation from one of our Managed Services customers. This week I was lucky enough to work on a problem worth sharing. The job was producing messages in a Spark job to Kafka. Half way through the Spark job, the job froze, no more messages were produced to Kafka. After […]

Read More

StreamSets – Hadoop Ingestion Made Simple

StreamSets recently announced and open sourced their first product, DataCollector. I had been given access to a preview version of the product and was quite impressed. Given their product is now public and generally available, I thought I would go through a super-simple demo. In my consulting role at phData, I’ve worked with many customers […]

Read More

Real-time Analytics on Medical Device Data – Part 3 – Schema

This is the third post in a series of posts on real-time analytics on medical device data. In our past post on infrastructure, we covered two methods of partitioning our dataset. As discussed, as opposed to partitioning by date, we will partition by hashed patient id. Partitioning by patient id allows us to quickly scan all records for a […]

Read More

Apache Kafka Performance Numbers

A search for “Apache Kafka performance” will result in dozens of articles but few results useful for estimating real-world results. Specifically I’ve few found results which run on hardware common to modern data centers, replicate the data with the common factor of 3, and many parallel producers and consumers. These results are meant to be used […]

Read More

Data Redaction in Hadoop

One of the growing trends we continue to see is concerns around properly handling personally identifiable information (PII).   Social Security Numbers, credit card numbers, passport numbers, etc,. It’s always been there, but with the recent large exposures, every company is going back to make sure data is being handled properly. Back in a former life, […]

Read More

Rolling Hadoop Upgrades

For those of us working in Hadoop operations, Hortonworks has a great read on Hadoop’s evaluation into a ever-breathing organism. Or in other words, as Hadoop continues to grow into the backbone of data centers, having it up 24/7/365 is a must. The article goes through the process of rolling upgrades. One of the critical […]

Read More

How Many Hadoop Clusters Should A Company Have?

One of the questions we get asked is “How many Hadoop clusters should we have?” And like all good technology answers, we generally respond with “It depends.” That being said, here are a few general rules we’ve seen applied across enterprise organizations. PROD vs non-PROD – Many organizations physically separate PROD and non-PROD infrastructure with […]

Read More