Cloudera

Getting Started with Kudu

Five years ago, enabling Data Science and Advanced Analytics on the Hadoop platform was hard. Organizations required strong Software Engineering capabilities to successfully implement complex Lambda architectures or even simply implement continuous ingest. Updating or deleting data, were simply nightmare. General Data Protection Regulation (GDPR) would have been an extreme challenge at that time.   […]

Read More

Cloudera Altus – First Look

I was lucky enough to attend StrataEU 2017 and one of the sessions was Deploying and managing Hive, Spark, and Impala in the public cloud led by Philip Langdale, Eugene Fratkin, and Jennifer Wu. I assumed this was a Cloudera Director session which we have lots of experience with, but I decided to pop my […]

Read More

Archiving Navigator Audit Data with StreamSets and Kafka

Andy Stadtler helped with this post Many of phData’s customers are heavy users of Cloudera Navigator. Cloudera Navigator provides metadata information to the user who can also audit all actions performed on data in the cluster. Per day one customer generates an average of 4GB Audit Data, which is stored by default in the mysql […]

Read More

Troubleshooting Spark and Kafka escalations for our managed services customers

My favorite technical task is handling a troubleshooting escalation from one of our Managed Services customers. This week I was lucky enough to work on a problem worth sharing. The job was producing messages in a Spark job to Kafka. Half way through the Spark job, the job froze, no more messages were produced to Kafka. After […]

Read More

Data Redaction in Hadoop

One of the growing trends we continue to see is concerns around properly handling personally identifiable information (PII).   Social Security Numbers, credit card numbers, passport numbers, etc,. It’s always been there, but with the recent large exposures, every company is going back to make sure data is being handled properly. Back in a former life, […]

Read More

Getting Into The Cloud With CDH In Minutes

A few weeks ago we wrote an article on the pros and cons of running your Hadoop capabilities in the Cloud compared to on-premise.  The conclusion was that there isn’t a right or wrong answer.  If you’re investing in data centers, it probably makes sense to run Hadoop on-premise.  If you’re not investing in data […]

Read More

Build a Hadoop Distribution like Hortonworks or Cloudera

Apache Hadoop and much of it’s ecosystem are free and open source. Due to it’s free nature, customers often ask, why would I need a distribution such as Cloudera, Hortonworks, or MapR? Indeed, some users of Hadoop do not use a distribution. Yahoo and Facebook for example build a distribution for internal use. The purpose […]

Read More

Hadoop Versions In Vendor Distributions

Hadoop distributions typically come with between 20-30 open source projects, all bundled together to make a “big data platform” enterprises can deploy and maintain in a sustained manor. E.g. Hadoop core, HBase, Hive, Spark, etc. The foundation is Hadoop core, with the others sitting alongside or on top. Two common questions come up with enterprises […]

Read More

What Everybody Ought to Know About Big Data

In Underhyped – Big Data as an Advance in the Scientific Method Yanpei Chen makes the argument that big data is a fundamental advancement to the scientific method. This is an exceedingly bold claim and to be honest I suspected a strong dose of sensationalism. I mentioned this to a shared mutual acquaintance. I was informed that […]

Read More