Exploring Spark MLlib: Part 1 – Setup and Ingest

This 4 part series will introduce Spark MLlib by walking through a basic example much like a chapter in Advanced Spark Analytics (phdata highly recommends). The goal is to cover a MLlib workflow end-to-end. The posts assume a basic understanding of Spark and the Scala programming language. As much as possible the code and examples are provided with comments to help with understanding the code.
First an introduction, as most people know, the ML in Spark’s MLlib stands for Machine Learning. The library strives to create a set of machine learning tools optimized for the Spark platform. It makes a field as deep and complex as machine learning approachable for mere mortals.
A common workflow for machine learning, reinforced in the book Advanced Spark Analytics, is to ingest, explore, transform, train, predict, evaluate, rinse and repeat. The same structure will guide the exploration and content of the posts.

Environment Setup

In this example, the data is a snapshot of real home listings that were for sale in the south metro of the Twin Cities. The goal ultimately is to use the data to train a basic MLlib algorithm to predict the price of a new potential listing in the area. The code and data for the example are hosted here on github. We’ll clone the repository on to the latest version of the HDP sandbox VM.
First we need to get the HDP 2.2 sandbox and the technical preview of Spark on the platform. Be sure walk through both links below in preparation.

Ingesting the data

For the purposes of this post, we’ll assume data was obtained from a 3rd party ingestion service and that we would get a new set of listings at a set interval, possibly daily.
Here we clone the repo

[root@sandbox ~]# git clone https://github.com/phdata/exploring-mllib-post
[root@sandbox ~]# cd exploring-mllib-post

And copy the data to HDFS

[root@sandbox exploring-mllib-post]# hdfs dfs -copyFromLocal homeprice.data /user/root/

Now we are set to explore using spark-shell which we will cover in the next post.
This post provided a quick introduction to MLlib and the foundation for future posts. We were able to get the sandbox VM setup with the necessary libraries. We pulled the code and ingested the data into HDFS.