How to Use the Kudu Quickstart on Windows

This blog post was written by Donald Sawyer and Frank Rischner.

Introduction to Apache Kudu

Apache Kudu is a distributed, highly available, columnar storage manager with the ability to quickly process data workloads that include inserts, updates, upserts, and deletes. Kudu integrates very well with Spark, Impala, and the Hadoop ecosystem. At phData, we use Kudu to achieve customer success for a multitude of use cases, including OLAP workloads, streaming use cases, machine learning, and data analysis/visualization.

If you want to develop applications that integrate with Apache Kudu, but would like to build your apps and run them locally on a Kudu cluster, you can use the Kudu Quickstart. The Quickstart will launch a three-node Kudu cluster on your local environment using Docker. The quickstart works very well in a Linux or Mac OS X environment, but the tutorial doesn’t yet include instructions for running a cluster on a Windows machine.

In this blog, you will learn how to stand up a local Kudu cluster on Windows, both with the Windows Subsystem for Linux (WSL) as well as running natively on Windows with Powershell. Once the cluster is running, you’ll also see an example of connecting to the cluster using Apache Spark. For additional examples of developing in Kudu, visit the Kudu docs page on developing Kudu applications.

Why Use a Kudu Quickstart Cluster?

Developers use a quickstart environment to test out their applications in a local environment — in this case, Kudu. Kudu is typically set up on a distributed platform like Cloudera Hadoop. It can be a real hassle to do your development, deploy to a Hadoop cluster, and then test it out, only to find out you need to make changes. By allowing the developer to run a Kudu cluster on their development machine, the quickstart solves the challenge of requiring all the infrastructure of Hadoop.

Docker is a container platform used to stand up infrastructure. It is an ideal tool to spin up lightweight containers used for testing or exploring new technologies. Compared to resource-intensive virtual machines, it is lightweight and start times are very quick. The containers are extensible and can be used multiple times.

Standing up a containerized environment isn’t the end of the road. You need to integrate it into your environment to use it with, in the case of this blog, a Spark application. Fortunately, the developers working on Apache Kudu have made the integration to Docker fairly seamless.

How to Prepare Your Environment for Running Kudu

Prepare your environment by following the Kudu Quickstart tutorial. But note that the section, Set KUDU_QUICKSTART_IP, will not work properly in Windows. The subsections below will give you the resources you need to get the Kudu cluster up and running.

There is also a companion git repository with tutorials, scripts, and example code for integrating Spark and Kudu with the Kudu Quickstart.

Prerequisites

There are a few prerequisites to using the Kudu Quickstart and Spark on Windows.

Install Docker for Windows
Set up Java, Scala, and Spark (all covered in this blog)

Once you’ve verified that Java, Scala, Spark, and Docker are working, you’re ready to start!

How to Launch and Destroy a Kudu Cluster Using the WSL

Many big data developers are comfortable in a Linux shell environment, and the Windows Subsystem for Linux (WSL) provides a lightweight layer to execute Linux binaries. Microsoft developed WSL, a Linux-compatible kernel interface that is shared by various Linux distributions, and can be downloaded from the Windows Store. Since docker cannot run on WSL, you will need the Docker Desktop app to run the actual containers.

After installing and configuring docker-client for WSL as described in Appendix A: Installing Docker for Windows Subsystem for Linux (WSL), you can follow the steps described in the Kudu Quickstart tutorial.

git clone https://github.com/apache/kudu
cd kudu

To automate the exposure of the IP and the start, we recommend putting the commands below in a wrapper script.

export KUDU_QUICKSTART_IP=$(ifconfig \
  | grep "inet " | grep -Fv 127.0.0.1 |  awk '{print $2}' \
  | tail -1) 

docker-compose -f docker/quickstart.yml up

The Kudu Docker cluster is now up and running and can be used for testing applications. To shut down the cluster, use the docker-compose stop command.

How to Launch and Destroy a Kudu Cluster Using Powershell

With Docker running, you can start following the Kudu Quickstart tutorial. As a matter of fact, you can follow the entire tutorial with one exception: setting the KUDU_QUICKSTART_IP environment variable.

Launch the Cluster

Follow the steps below to bring up the Kudu Quickstart cluster from Powershell.

Open up a Powershell terminal (⊞ Win + r, then powershell.exe)
Clone the Kudu repository

Set the KUDU_QUICKSTART_IP environment variable

$env:KUDU_QUICKSTART_IP=(Get-NetIPConfiguration | ` 
 Where-Object {$_.IPv4DefaultGateway -ne $null -and `  
 $_.NetAdapter.Status -ne "Disconnected"}).IPv4Address.IPAddress

Follow the remainder of the tutorial, starting at Bring up the Cluster

Destroy the Cluster

If you brought up the cluster without the -d flag in the docker-compose command, you can type ctrl+c to shut down the cluster.

If you used the -d option, then follow one of the suggestions in Destroying the Cluster in the Kudu Quickstart tutorial.

How to Use Spark & Kudu in the Spark Shell

As a quick step to get up and running, you can test out your Spark integration using spark-shell. This section is similar to the tutorial provided in the Kudu Quickstart github repo.

This section of the tutorial will cover the following steps:

Launch spark-shell with Kudu support
Create some data in a Spark DataFrame
Initiate a Kudu Context

Launch the Spark Shell with Kudu Support

The Kudu libraries will not be available by default. Include them as a part of the spark-shell launch command.

spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.11.1

Set Up a DataFrame to Write to Kudu

First, you’ll need some data. Create a DataFrame with some data. Though the code below won’t necessitate a distributed system, it will provide data that can be written to Kudu. A larger dataset could be used once you are able to interact with Kudu successfully.

This portion of the tutorial will go through the following steps for accessing Kudu with Spark:

Set up a DataFrame to be written to Kudu
Instantiate the KuduContext
Drop/Create a Kudu table
Insert data into Kudu from a Spark DataFrame
Read data from Kudu into a Spark DataFrame

Create the Schema for the Dataset

Notice that in the schema for the dataset, the first three fields are not nullable. This is because they will be used for the primary key in the Kudu table, and PK columns cannot be null.

import org.apache.spark.sql.types._

val gameSchema = StructType(List(
  StructField("release_year", IntegerType, false),
  StructField("title", StringType, false),
  StructField("publishers", StringType, false),
  StructField("platforms", StringType, true)))

Create Some Data to be Inserted

The data here is a curated subset from the Video Games Classification Database on data.gov. The year of release, romanised title, publishers, and platform fields were used. After submitting this code block, the data will be in a DataFrame called, gameDf.

import org.apache.spark.sql.Row
val gameData = Seq(  
  Row(2017, "1-2-SWITCH", "NINTENDO", "Nintendo Switch"), 
  Row(2018, "7'SCARLET", "AKSYS GAMES", "Sony PS Vita"), 
  Row(2019, "8-BIT HORDES", "SOEDESCO", "Sony Playstation 4"), 
  Row(2017, "AEREA", "SOEDESCO", "Sony Playstation 4"), 
  Row(2018, "ARK PARK", "SNAIL GAMES", "Sony Playstation 4"), 
  Row(2017, "ARMS", "NINTENDO", "Nintendo Switch"), 
  Row(2017, "BAD APPLE WARS", "AKSYS", "Sony PS Vita"), 
  Row(2017, "CAVE STORY+", "SEGA/NICALIS", "Nintendo Switch"), 
  Row(2017, "COLLAR X MALICE", "AKSYS", "Sony PS Vita"), 
  Row(2018, "CONAN EXILES", "FUNCOM", "PC,Sony Playstation 4"), 
  Row(2018, "CONSTRUCTOR PLUS", "SYSTEM 3", "Nintendo Switch"), 
  Row(2017, "CULDCEPT REVOLT", "NIS AMERICA", "Nintendo 3DS"), 
  Row(2018, "DETECTIVE PIKACHU", "NINTENDO", "Nintendo 3DS"), 
  Row(2016, "DISNEY ART ACADEMY", "NINTENDO", "Nintendo 3DS"))

val gameDf = spark.createDataFrame(
  spark.sparkContext.parallelize(gameData), 
  gameSchema)

Set Up the KuduContext

At this point, you will set up the KuduContext, pointing to the Kudu masters that are running on your local machine.

import org.apache.kudu.spark.kudu.KuduContext
val kuduContext = new KuduContext(
  "localhost:7051,localhost:7151,localhost:7251",
  spark.sparkContext)

Create the Kudu Table

Using the previously instantiated KuduContext, a table will be created from gameDf. Before creating the table, check if the table exists and delete it since the tutorial is only doing inserts and not upserts. The code below will perform the deletion, followed by the table creation.

Additional documentation around upserts in Spark can be found in the Kudu documentation.

val gameKuduTableName = "games"

if(kuduContext.tableExists(gameKuduTableName)) {
  kuduContext.deleteTable(gameKuduTableName)
}

import scala.collection.JavaConverters._
import org.apache.kudu.client.CreateTableOptions
kuduContext.createTable(gameKuduTableName,
  gameSchema, // Kudu schema with PK columns set as Not Nullable
  Seq("release_year", "title", "publishers"), // Primary Key Columns
  new CreateTableOptions().
    setNumReplicas(3).
    addHashPartitions(List("release_year").asJava, 2))

Write the DataFrame to Kudu

At this point, the data is ready to be inserted using the Kudu API.

kuduContext.insertRows(gameDf, gameKuduTableName)

Read the Kudu Table into a DataFrame

Now, verify that the data has been written to Kudu. You can use the Kudu web UI to look at the tablet information (http://localhost:8050/tablets), but to use the data in Spark, read the data from the KuduContext into a DataFrame. The example below will filter out only rows where a game was released in 2017, and was adapted from the Kudu Spark quickstarts.

// We need to use leader_only because Kudu on Docker currently doesn't
// support Snapshot scans due to `--use_hybrid_clock=false`.
val gamesKuduDf = spark.read.
  option("kudu.master", 
    "localhost:7051,localhost:7151,localhost:7251").
  option("kudu.table", gameKuduTableName).
  option("kudu.scanLocality", "leader_only").
  format("kudu").
  load

gamesKuduDf.where($"release_year" === 2017).show

At this point, Kudu has been used to read and write data using Spark. This should be enough to get started playing with Spark/Kudu in the Spark shell on a local Kudu Quickstart setup.

How to Build a Spark Application for Apache Kudu

If you’d like to try using an IDE like IntelliJ to build a Spark application that will integrate with the Kudu Quickstart, try out the application on our github. The project is set up with Maven, so it can be imported into your IDE (we tested it with IntelliJ).

The github repo that contains the spark-shell and Spark application can be cloned from https://github.com/phdata/kudu-quickstart-windows-example

Appendix A: Installing Docker for Windows Subsystem for Linux (WSL)

Follow the steps below to install Docker in the Windows Subsystem for Linux. The steps below assume you are using the Ubuntu images for WSL and you have WSL enabled. On Docker for Windows, make sure that the option “Expose daemon on tcp://localhost:2375 without TLS” is enabled.

#upgrade ubuntu packages (optional)
sudo apt-get update -y
sudo apt-get upgrade

#install required packages and add docker repo

sudo apt-get install -y     apt-transport-https ca-certificates     
curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key 
add -
sudo add-apt-repository    "deb [arch=amd64] 
https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update -y

#install docker, python & pip
sudo apt-get install -y docker-ce
sudo usermod -aG docker $USER
sudo apt-get install -y python python-pip
pip install --user docker-compose

# ensure .local/bin is on the WSL PATH
echo $PATH | grep .local
# add export of the windows docker port to bashrc
echo "export DOCKER_HOST=tcp://localhost:2375" >> ~/.bashrc && source 
~/.bashrc

After you’ve installed docker-ce and started docker on Windows, test the connectivity to docker.

#test the docker installation 
docker info

You are now set to run any Docker containers and connect to them from WSL.

Conclusion

Apache Kudu is a great distributed data storage system, but you don’t necessarily want to stand up a full cluster to try it out. The Kudu Quickstart is a valuable tool to experiment with Kudu on your local machine. With a few small tweaks you can use it on Windows, and you won’t be limited to UNIX or Mac OS X, which are the only platforms currently mentioned by the Kudu Quickstart documentation.

If you’d like additional help building or integrating Kudu or other Hadoop and Spark systems to meet your analytical needs, reach out to phData via email at sales@phdata.io.