Blog

Configuring Oozie for Spark SQL on a Secure Hadoop Cluster

A secure hadoop cluster requires actions in Oozie to be authenticated. However, due to the way that Oozie workflows execute actions, Kerberos credentials are not available to actions launched by Oozie. Oozie runs actions on the Hadoop cluster. Specifically, for legacy reasons, each action is started inside a single task map-only MapReduce job. Spark does […]

Read More

Parquet vs Text Compression

Parquet is a columnar data format. Columnar data formats, which store data grouped by columns, when tuned specifically for a given dataset can achieve compression ratios of up to 95%. However, with zero tuning, they still provide excellent compression. For example, below I use Faker to generate 1M rows: from faker import Factory fake = […]

Read More

Spark Job History Server OutOfMemoryError

One of our customers hit an issue where the Spark Job History running out of memory every few hours. The heap size was set to 4GB and the customer was not a heavy user of Spark, submitting no more than a couple jobs a day. We did notice that they had many long running spark-shell […]

Read More

try-with-resources in Scala

At phData, many of customers are deploying Apache Spark. As such, those customers look to phData not only for Apache Spark, Hadoop, and Kafka expertise, but also Scala. The following is an explanation of a construct we use in one of our Scala and Spark courses. This post will show a generic function that is […]

Read More

Troubleshooting Spark and Kafka escalations for our managed services customers

My favorite technical task is handling a troubleshooting escalation from one of our Managed Services customers. This week I was lucky enough to work on a problem worth sharing. The job was producing messages in a Spark job to Kafka. Half way through the Spark job, the job froze, no more messages were produced to Kafka. After […]

Read More

Hive corruption due to newlines and carriage returns

phData has customers across the spectrum of use cases. One of our customers stores vast volumes of XML. One of our engineers was recently asked: Hive sometimes corrupts my data and other times it does not. What is going on? The answer is quite interesting so I thought I would share. Specifically the query they […]

Read More

Out of my comfort zone, but growing

“If we’re growing, we’re always going to be out of our comfort zone.” – John Maxwell My phData internship started with being lost at St. Thomas. Although I eventually found my way, it was a new chapter in my life since I didn’t have any background in Computer Science. Not having a background in the […]

Read More

StreamSets – Hadoop Ingestion Made Simple

StreamSets recently announced and open sourced their first product, DataCollector. I had been given access to a preview version of the product and was quite impressed. Given their product is now public and generally available, I thought I would go through a super-simple demo. In my consulting role at phData, I’ve worked with many customers […]

Read More

Real-time Analytics on Medical Device Data – Part 3 – Schema

This is the third post in a series of posts on real-time analytics on medical device data. In our past post on infrastructure, we covered two methods of partitioning our dataset. As discussed, as opposed to partitioning by date, we will partition by hashed patient id. Partitioning by patient id allows us to quickly scan all records for a […]

Read More