Exploring Spark MLlib: Part 3 – Transformation and Model Creation

In the first two(1,2) posts we got the data ingested and explored the data with the spark-shell. Now we’ll move on to creating and submitting our code as standalone Spark application. Again all the code cover in the posts can be found here.

We’ll start by creating a case class and a function for parsing the data into the that class. This will help clarify the code in future operations.

Next we use the parse function to create an RDD of parsed home prices.

As we saw in the previous posts, there were some outlier data. We’ll use the filter function of the RDD to remove that data and limit the home prices between $100k and $400k along with houses over 1000 square feet.

The recommender we’re going to use is a simple linear regression model. The output of the model is set of weights and an intercept. If you remember back to your linear algebra days, the common form for slope is y = ax +b. This is exactly the kind of formula the linear regression model optimizes for but instead of using just one variable, for instance using square feet to determine price. We’ll introduce more features, like bathrooms, bedrooms, etc. and therefore more coefficients or weights (y = a1x + a2z +… + b). The input for the linear regression model is an Array of LabeledPoints. The label in this case is the price of the home and the set of features are bedrooms, baths, square feet, etc. which will be local Vectors. Based on the data we have, we use the dense form of the Vector. Also, the model trains better if the data has been scaled with mean zero and unit variance. The latest version of MLlib has a flag to do this as part of the training and there is a good comment explaining why. The MLlib also has helper methods to accomplish that. Here is a look at the code

Now we’re finally to the fun part — training the model. The model has two hyperparameters, number of iterations and step size. The number of iterations lets the algorithm know how many times to run through the model to adjust coefficients closer to an optimal model and the step size determines how much to adjust the model per iteration.

To train the model we’ll build the source and submit the job.

In summary, this post finally got down the the nuts and bolts of training a model. It walked through simple scaler transformations and filtering. Then described a bit about the MLlib model we’re using to predict home prices. In the next post, we’ll look at the challenges of exporting the model and how offline non-spark jobs can utilize the model.