August 6, 2018

Getting Started with Kudu

By Brock Noland

Five years ago, enabling Data Science and Advanced Analytics on the Hadoop platform was hard. Organizations required strong Software Engineering capabilities to successfully implement complex Lambda architectures or even simply implement continuous ingest. Updating or deleting data was simply a nightmare. General Data Protection Regulation (GDPR) would have been an extreme challenge at that time.

Apache Kudu’s Initial Commit

In that context, on October 11th 2012, Todd Lipcon perform Apache Kudu’s initial commit. The commit message was:

  • Code for writing cfiles seems to basically work.
  • Need to write code for reading cfiles, still.

And Kudu development was off and running. Around this same time, Todd, on his internal Wiki page, started listing out the papers he was reading to develop the theoretical background for creating Kudu. I followed along, reading as many as I could, understanding little, because I knew Todd was up to something important. About a year after that initial commit, I got my Kudu first commit, documenting the upper bound a library. This is a small contribution of which I am still proud of.

In the meantime, I was lucky enough to be a founder of a Hadoop Managed Services and Consulting company known as phData. We found that a majority of our customers had use cases which Kudu vastly simplified. Whether it’s Change Data Capture (CDC) from thousands of source tables to Internet of Things (IoT) ingest, Kudu makes life much easier as both an operator of a Hadoop cluster and a developer providing business value on the platform.

Getting Started with Kudu

Through this work, I was lucky enough to be a co-author of Getting Started with Kudu. The book is a summation of mine and our co-authors, Jean-Marc Spaggiari, Mladen Kovacevic, and Ryan Bosshart, learnings while cutting our teeth on early versions of Kudu. Specifically you will learn:

  1. Theoretical understanding of Kudu concepts in simple plain spoken words and simple diagrams
  2. Why, for many use cases, using Kudu is so much easier than other ecosystem storage technologies
  3. How Kudu enables Hybrid Transactional/Analytical Processing (HTAP) use cases
  4. How to design IoT, Predictive Modeling, and Mixed Platform Solutions using Kudu
  5. How to design Kudu Schemas

Looking forward, I am excited to see Kudu gain additional features and adoption and eventually the second revision of this title. In the meantime, if you have feedback or questions, please reach out on the #getting-started-kudu channel of the Kudu Slack or if you prefer non-real-time communication, please use the user@ mailing list!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit