This blog post was written by Donald Sawyer and Frank Rischner.
Introduction to Apache Kudu
Apache Kudu is a distributed, highly available, columnar storage manager with the ability to quickly process data workloads that include inserts, updates, upserts, and deletes. Kudu integrates very well with Spark, Impala, and the Hadoop ecosystem. At phData, we use Kudu to achieve customer success for a multitude of use cases, including OLAP workloads, streaming use cases, machine learning, and data analysis/visualization.
If you want to develop applications that integrate with Apache Kudu, but would like to build your apps and run them locally on a Kudu cluster, you can use the Kudu Quickstart. The Quickstart will launch a three-node Kudu cluster on your local environment using Docker. The quickstart works very well in a Linux or Mac OS X environment, but the tutorial doesn’t yet include instructions for running a cluster on a Windows machine.
In this blog, you will learn how to stand up a local Kudu cluster on Windows, both with the Windows Subsystem for Linux (WSL) as well as running natively on Windows with Powershell. Once the cluster is running, you’ll also see an example of connecting to the cluster using Apache Spark. For additional examples of developing in Kudu, visit the Kudu docs page on developing Kudu applications.
Why Use a Kudu Quickstart Cluster?
Developers use a quickstart environment to test out their applications in a local environment — in this case, Kudu. Kudu is typically set up on a distributed platform like Cloudera Hadoop. It can be a real hassle to do your development, deploy to a Hadoop cluster, and then test it out, only to find out you need to make changes. By allowing the developer to run a Kudu cluster on their development machine, the quickstart solves the challenge of requiring all the infrastructure of Hadoop.
Docker is a container platform used to stand up infrastructure. It is an ideal tool to spin up lightweight containers used for testing or exploring new technologies. Compared to resource-intensive virtual machines, it is lightweight and start times are very quick. The containers are extensible and can be used multiple times.
Standing up a containerized environment isn’t the end of the road. You need to integrate it into your environment to use it with, in the case of this blog, a Spark application. Fortunately, the developers working on Apache Kudu have made the integration to Docker fairly seamless.
How to Prepare Your Environment for Running Kudu
Prepare your environment by following the Kudu Quickstart tutorial. But note that the section, Set KUDU_QUICKSTART_IP, will not work properly in Windows. The subsections below will give you the resources you need to get the Kudu cluster up and running.
There is also a companion git repository with tutorials, scripts, and example code for integrating Spark and Kudu with the Kudu Quickstart.
Prerequisites
There are a few prerequisites to using the Kudu Quickstart and Spark on Windows.
- Install Docker for Windows
- Set up Java, Scala, and Spark (all covered in this blog)
Once you’ve verified that Java, Scala, Spark, and Docker are working, you’re ready to start!
How to Launch and Destroy a Kudu Cluster Using the WSL
Many big data developers are comfortable in a Linux shell environment, and the Windows Subsystem for Linux (WSL) provides a lightweight layer to execute Linux binaries. Microsoft developed WSL, a Linux-compatible kernel interface that is shared by various Linux distributions, and can be downloaded from the Windows Store. Since docker cannot run on WSL, you will need the Docker Desktop app to run the actual containers.
After installing and configuring docker-client for WSL as described in Appendix A: Installing Docker for Windows Subsystem for Linux (WSL), you can follow the steps described in the Kudu Quickstart tutorial.
git clone https://github.com/apache/kudu cd kudu
To automate the exposure of the IP and the start, we recommend putting the commands below in a wrapper script.
export KUDU_QUICKSTART_IP=$(ifconfig \ | grep "inet " | grep -Fv 127.0.0.1 | awk '{print $2}' \ | tail -1) docker-compose -f docker/quickstart.yml up
The Kudu Docker cluster is now up and running and can be used for testing applications. To shut down the cluster, use the docker-compose stop command.
How to Launch and Destroy a Kudu Cluster Using Powershell
With Docker running, you can start following the Kudu Quickstart tutorial. As a matter of fact, you can follow the entire tutorial with one exception: setting the KUDU_QUICKSTART_IP environment variable.
Launch the Cluster
Follow the steps below to bring up the Kudu Quickstart cluster from Powershell.
- Open up a Powershell terminal (⊞ Win + r, then powershell.exe)
- Clone the Kudu repository
- Set the KUDU_QUICKSTART_IP environment variable
$env:KUDU_QUICKSTART_IP=(Get-NetIPConfiguration | ` Where-Object {$_.IPv4DefaultGateway -ne $null -and ` $_.NetAdapter.Status -ne "Disconnected"}).IPv4Address.IPAddress
- Follow the remainder of the tutorial, starting at Bring up the Cluster
Destroy the Cluster
If you brought up the cluster without the -d flag in the docker-compose command, you can type ctrl+c to shut down the cluster.
If you used the -d option, then follow one of the suggestions in Destroying the Cluster in the Kudu Quickstart tutorial.
How to Use Spark & Kudu in the Spark Shell
As a quick step to get up and running, you can test out your Spark integration using spark-shell. This section is similar to the tutorial provided in the Kudu Quickstart github repo.
This section of the tutorial will cover the following steps:
- Launch spark-shell with Kudu support
- Create some data in a Spark DataFrame
- Initiate a Kudu Context
Launch the Spark Shell with Kudu Support
The Kudu libraries will not be available by default. Include them as a part of the spark-shell launch command.
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.11.1
Set Up a DataFrame to Write to Kudu
First, you’ll need some data. Create a DataFrame with some data. Though the code below won’t necessitate a distributed system, it will provide data that can be written to Kudu. A larger dataset could be used once you are able to interact with Kudu successfully.
This portion of the tutorial will go through the following steps for accessing Kudu with Spark:
- Set up a DataFrame to be written to Kudu
- Instantiate the KuduContext
- Drop/Create a Kudu table
- Insert data into Kudu from a Spark DataFrame
- Read data from Kudu into a Spark DataFrame
Create the Schema for the Dataset
Notice that in the schema for the dataset, the first three fields are not nullable. This is because they will be used for the primary key in the Kudu table, and PK columns cannot be null.
import org.apache.spark.sql.types._ val gameSchema = StructType(List( StructField("release_year", IntegerType, false), StructField("title", StringType, false), StructField("publishers", StringType, false), StructField("platforms", StringType, true)))
Create Some Data to be Inserted
The data here is a curated subset from the Video Games Classification Database on data.gov. The year of release, romanised title, publishers, and platform fields were used. After submitting this code block, the data will be in a DataFrame called, gameDf.
import org.apache.spark.sql.Row val gameData = Seq( Row(2017, "1-2-SWITCH", "NINTENDO", "Nintendo Switch"), Row(2018, "7'SCARLET", "AKSYS GAMES", "Sony PS Vita"), Row(2019, "8-BIT HORDES", "SOEDESCO", "Sony Playstation 4"), Row(2017, "AEREA", "SOEDESCO", "Sony Playstation 4"), Row(2018, "ARK PARK", "SNAIL GAMES", "Sony Playstation 4"), Row(2017, "ARMS", "NINTENDO", "Nintendo Switch"), Row(2017, "BAD APPLE WARS", "AKSYS", "Sony PS Vita"), Row(2017, "CAVE STORY+", "SEGA/NICALIS", "Nintendo Switch"), Row(2017, "COLLAR X MALICE", "AKSYS", "Sony PS Vita"), Row(2018, "CONAN EXILES", "FUNCOM", "PC,Sony Playstation 4"), Row(2018, "CONSTRUCTOR PLUS", "SYSTEM 3", "Nintendo Switch"), Row(2017, "CULDCEPT REVOLT", "NIS AMERICA", "Nintendo 3DS"), Row(2018, "DETECTIVE PIKACHU", "NINTENDO", "Nintendo 3DS"), Row(2016, "DISNEY ART ACADEMY", "NINTENDO", "Nintendo 3DS")) val gameDf = spark.createDataFrame( spark.sparkContext.parallelize(gameData), gameSchema)
Set Up the KuduContext
At this point, you will set up the KuduContext, pointing to the Kudu masters that are running on your local machine.
import org.apache.kudu.spark.kudu.KuduContext val kuduContext = new KuduContext( "localhost:7051,localhost:7151,localhost:7251", spark.sparkContext)
Create the Kudu Table
Using the previously instantiated KuduContext, a table will be created from gameDf. Before creating the table, check if the table exists and delete it since the tutorial is only doing inserts and not upserts. The code below will perform the deletion, followed by the table creation.
Additional documentation around upserts in Spark can be found in the Kudu documentation.
val gameKuduTableName = "games" if(kuduContext.tableExists(gameKuduTableName)) { kuduContext.deleteTable(gameKuduTableName) } import scala.collection.JavaConverters._ import org.apache.kudu.client.CreateTableOptions kuduContext.createTable(gameKuduTableName, gameSchema, // Kudu schema with PK columns set as Not Nullable Seq("release_year", "title", "publishers"), // Primary Key Columns new CreateTableOptions(). setNumReplicas(3). addHashPartitions(List("release_year").asJava, 2))
Write the DataFrame to Kudu
At this point, the data is ready to be inserted using the Kudu API.
kuduContext.insertRows(gameDf, gameKuduTableName)
Read the Kudu Table into a DataFrame
Now, verify that the data has been written to Kudu. You can use the Kudu web UI to look at the tablet information (http://localhost:8050/tablets), but to use the data in Spark, read the data from the KuduContext into a DataFrame. The example below will filter out only rows where a game was released in 2017, and was adapted from the Kudu Spark quickstarts.
// We need to use leader_only because Kudu on Docker currently doesn't // support Snapshot scans due to `--use_hybrid_clock=false`. val gamesKuduDf = spark.read. option("kudu.master", "localhost:7051,localhost:7151,localhost:7251"). option("kudu.table", gameKuduTableName). option("kudu.scanLocality", "leader_only"). format("kudu"). load gamesKuduDf.where($"release_year" === 2017).show
At this point, Kudu has been used to read and write data using Spark. This should be enough to get started playing with Spark/Kudu in the Spark shell on a local Kudu Quickstart setup.
How to Build a Spark Application for Apache Kudu
If you’d like to try using an IDE like IntelliJ to build a Spark application that will integrate with the Kudu Quickstart, try out the application on our github. The project is set up with Maven, so it can be imported into your IDE (we tested it with IntelliJ).
The github repo that contains the spark-shell and Spark application can be cloned from https://github.com/phdata/kudu-quickstart-windows-example
Appendix A: Installing Docker for Windows Subsystem for Linux (WSL)
Follow the steps below to install Docker in the Windows Subsystem for Linux. The steps below assume you are using the Ubuntu images for WSL and you have WSL enabled. On Docker for Windows, make sure that the option “Expose daemon on tcp://localhost:2375 without TLS” is enabled.
#upgrade ubuntu packages (optional) sudo apt-get update -y sudo apt-get upgrade #install required packages and add docker repo sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update -y #install docker, python & pip sudo apt-get install -y docker-ce sudo usermod -aG docker $USER sudo apt-get install -y python python-pip pip install --user docker-compose # ensure .local/bin is on the WSL PATH echo $PATH | grep .local # add export of the windows docker port to bashrc echo "export DOCKER_HOST=tcp://localhost:2375" >> ~/.bashrc && source ~/.bashrc
After you’ve installed docker-ce and started docker on Windows, test the connectivity to docker.
#test the docker installation docker info
You are now set to run any Docker containers and connect to them from WSL.
Conclusion
Apache Kudu is a great distributed data storage system, but you don’t necessarily want to stand up a full cluster to try it out. The Kudu Quickstart is a valuable tool to experiment with Kudu on your local machine. With a few small tweaks you can use it on Windows, and you won’t be limited to UNIX or Mac OS X, which are the only platforms currently mentioned by the Kudu Quickstart documentation.
If you’d like additional help building or integrating Kudu or other Hadoop and Spark systems to meet your analytical needs, reach out to phData via email at sales@phdata.io.