Exploring Spark MLlib: Part 2 – Exploring the data

In the last post we got the environment setup. Now that the data is in the cluster and spark is setup, we can begin to explore the data. A common way to start is by using the spark-shell. Spark-shell is a powerful command line interpreter for the Spark environment. Let’s get started.

Execute the spark-shell command. Since we are on a resource constrained sandbox VM we’ll use minimal resources.

The spark-shell drops us into CLI by which we can explore the data. First, we should determine the structure of the file.

Here we can see the file is pipe (|) delimited and the fields are: MLS #, City, Square Feet, Bedrooms, Baths, Garage Stalls, Age, Lots Size, Price) Let’s use MLlib statistical utilities for explore further some of attributes of the data.

Now we’ll take a look at how correlated the data is, or put another way, how much does the price of a house depend on its square feet.

There is a strong correlation meaning square feet is a major factor in price of house which is intuitive.

In this post we explored the home listing data set. We found out the structure, looked at the basic statistics of the data and took a dabble into some correlation statistics metrics.

The shell is a powerful tool that allows iterative exploration but wasn’t creating very modular code. We also noticed some of the data didn’t look right, as there were listings for $1. In the next post, we’ll start creating more modular and readable code along with simple transformations and filtering of the data.