Getting Started with Apache Kudu

Five years ago, enabling Data Science and Advanced Analytics on the Hadoop platform was hard. Organizations required strong Software Engineering capabilities to successfully implement complex Lambda architectures or even simply implement continuous ingest. Updating or deleting data was simply a nightmare. General Data Protection Regulation (GDPR) would have been an extreme challenge at that time.

Apache Kudu’s Initial Commit

In that context, on October 11th 2012, Todd Lipcon perform Apache Kudu’s initial commit. The commit message was:

Code for writing cfiles seems to basically work.
Need to write code for reading cfiles, still.

And Kudu development was off and running. Around this same time, Todd, on his internal Wiki page, started listing out the papers he was reading to develop the theoretical background for creating Kudu. I followed along, reading as many as I could, understanding little, because I knew Todd was up to something important. About a year after that initial commit, I got my Kudu first commit, documenting the upper bound a library. This is a small contribution of which I am still proud of.

In the meantime, I was lucky enough to be a founder of a Hadoop Managed Services and Consulting company known as phData. We found that a majority of our customers had use cases which Kudu vastly simplified. Whether it’s Change Data Capture (CDC) from thousands of source tables to Internet of Things (IoT) ingest, Kudu makes life much easier as both an operator of a Hadoop cluster and a developer providing business value on the platform.

Getting Started with Kudu

Through this work, I was lucky enough to be a co-author of Getting Started with Kudu. The book is a summation of mine and our co-authors, Jean-Marc Spaggiari, Mladen Kovacevic, and Ryan Bosshart, learnings while cutting our teeth on early versions of Kudu. Specifically you will learn:

Theoretical understanding of Kudu concepts in simple plain spoken words and simple diagrams
Why, for many use cases, using Kudu is so much easier than other ecosystem storage technologies
How Kudu enables Hybrid Transactional/Analytical Processing (HTAP) use cases
How to design IoT, Predictive Modeling, and Mixed Platform Solutions using Kudu
How to design Kudu Schemas

Looking forward, I am excited to see Kudu gain additional features and adoption and eventually the second revision of this title. In the meantime, if you have feedback or questions, please reach out on the #getting-started-kudu channel of the Kudu Slack or if you prefer non-real-time communication, please use the user@ mailing list!

Getting Started with Kudu

Apache Kudu’s Initial Commit

Getting Started with Kudu

More to explore

From Pipelines to Loops: How Fivetran + Census Reflects a Shift in Data Architecture

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

From Pipelines to Loops: How Fivetran + Census Reflects a Shift in Data Architecture

Snowflake Query Tagging Best Practices

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning