Exploring Spark MLlib: Part 4 – Exporting the model for use outside of Spark

In the previous posts(1,2,3) we explored Spark MLlib’s tool bag for helping train a linear regression model. We used for sale home listings as the training set. Now it would be useful to take that trained model and expose it as a real-time look up service for people looking wondering what price they should put their house on the market for.

In the latest release of Spark MLlib, 1.3, there have been a number of updates to make the constructors of the model’s public and a common save facility for export as there is clear need for use of the model outside of Spark. There have been discussions about conforming to a standard to use for saving ML models and whether they could be transferable outside of MLlib. One option the PMML definition, however the licensing around the implementations have caused some heartburn. It is clear that standards around ML model export definitions will be a space to watch. The issue is that ML algorithms are evolving and the definitions from one implementation to the next could vary depending on the platform.

The version of MLlib we’re working with doesn’t include those features, but Linear Regression algorithms like the one in the example are very straight so we’ll walk through how to export.

First, the trained model should be exported in our Spark Job. Here’s a look at the code.
For the linear regression model, there are two attributes that make up the model, the intercept and the weights. We could have just saved those values but instead we serialized the whole object. Similarly, since we scaled all the features when we trained the model we also have to export the scaler model for use in our “online” recommender. Now let’s take a look at using the models a context outside of spark. First we copy the models out of HDFS.

[root@sandbox exploring-mllib-post]# hdfs dfs -copyToLocal /user/root/linReg.model .
[root@sandbox exploring-mllib-post]# hdfs dfs -copyToLocal /user/root/scaler.model .

Here is what our online use of the model looks like. We run the spark context locally and use it to deserialize the models. Then we use them to predict an 11 year old, 2 bed, 2 bath, 1 garage house with 2220 square feet.

Now we run the recommender

[root@sandbox exploring-mllib-post]# mvn clean compile -e exec:java
…
209560.89433285908

Great! $209k would be a starting point for that listing.

There you have it, an exploration of Spark’s MLlib. We ingested, explored, transformed, trained, evaluated and exported a machine learning model.