March 6, 2023

Building a Predictive Model in KNIME

By John Emery

If you spend even a few minutes on KNIME’s website or browsing through their whitepapers and blog posts, you’ll notice a common theme: a strong emphasis on data science and predictive modeling. Delving further into KNIME Analytics Platform’s Node Repository reveals a treasure trove of data science-focused nodes, from linear regression to k-means clustering to ARIMA modeling—and quite a bit in between.

The great thing about building a predictive model in KNIME is its simplicity. There is no need to be a Python programmer or to have an advanced degree in mathematics or computer science (although these things certainly don’t hurt). If you can connect a few nodes together and understand the various configuration settings of your desired model, you can do it in KNIME.

In this blog post, we will visit a few types of predictive models that are available in either the base KNIME installation or via a free extension.

Building a Linear Regression Model in KNIME

As anyone who has ever taken an elementary statistics course can attest, linear regression is the first and, perhaps, most important predictive model that one can learn. As a quick refresher, what is linear regression?

Concisely, linear regression is the relationship between a dependent variable and one or more independent variables. The mathematical formulation of a linear regression model results in a linear equation where the independent variables are combined with coefficients to provide a prediction for the dependent variable.

By using KNIME, you don’t need to worry about mathematical formulas or the theoretical underpinnings—so long as you understand when and why you should use a linear regression. We encourage you to study this topic if you intend to build predictive models. 

Linear Regression Nodes

Linear Regression Nodes.png

To begin, open KNIME Analytics Platform and open Analytics → Mining → Linear/Polynomial Regression within the Node Repository. Inside that folder, you will find three nodes, of which we’ll focus on two: Linear Regression Learner and Linear Regression Predictor.

In general within KNIME, the Learner nodes take an existing dataset and build a predictive model based on the given data. The Predictor nodes then connect to the learned model and a dataset that was not previously used to build the model. Let’s look at an example.

Atlantic Hurricanes

There is a well-known relationship between the sustained wind speed of a hurricane and its barometric pressure. To study this relationship, we can build a linear regression model in KNIME using a dataset we downloaded from NOAA

To build this model, the first step is to create training (for the learner node) and testing (for the predictor node) sets. To do this, we can configure the Partitioning node, which generates the two outputs. From here, connect the training set to the Linear Regression Learner node. For our model, we will use a storm’s minimum pressure to predict its maximum sustained winds. The node’s configuration may look like the image below.

Linear Regression Learner Configuration.png

The learner node contains two output ports: a blue square which holds the model information and a black triangle through which users can view variable coefficients.

In the next step we will connect the model output of the learner node to the Linear Regression Predictor node along with the testing output from the Partitioning node. The nice thing about the predictor node is that it has no required configuration settings. You can connect and run it right away. 

With this model, there is a very strong relationship between the two variables. As you scroll through the results of the model, you will see that the predicted maximum wind speeds are very close to the actual values within the test set.

Linear Regression Output.png

Building a Decision Tree Model in KNIME

The next predictive model that we want to talk about is the decision tree. Unlike linear regression, which is relatively simple, decision trees can come in a variety of flavors and can be used for both classification and regression-type models. While in general the term “decision tree” can apply to both, in KNIME they make a distinction between the types with their nodes.

  • Decision Tree Learner/Predictor: Used to build and predict classification trees
  • Simple Regression Tree Learner/Predictor: Used to build and predict regression trees
 

Regardless of the type of model you wish to construct, a decision tree follows the same general principles. Through various algorithms, the tree places records from the data set into binary groups (yes/no, 0/1, true/false) until a final designation is achieved. The term “decision tree” is derived from the branching nature of the flow diagram commonly associated with this type of model.

Decision Tree Overview from Wikipedia.png

Although decision trees are more computationally complex than linear regression models, building one in KNIME is no more complicated than the other. Again, we stress the importance of understanding when and why you should utilize a decision tree. 

With all that behind us, let’s look at a simple example of a classification tree and how to build one in KNIME.

Animal Classification

How can you classify animals? Two legs versus four, lungs versus gills, and scaly versus furry are just a few ways to put an animal into one group or another. For this problem, we will visit a famous dataset shared by the University of California, Irvine Machine Learning Repository.

Decision Tree Dataset.png

In this dataset, we have 100 animals with a variety of boolean and numeric values associated with each. For example, we can see that aardvarks are hairy, lack feathers, and possess four legs. The type column indicates to which class each animal belongs—mammals are put into class 1, fish into 4, and so on.

Building a decision tree model is almost identical to the linear regression example we saw above. In fact, the only real difference between the two are the configuration settings within the learner and predictor nodes. As you can see below, the workflow is quite simple and the model predicted almost every animal’s type correctly. 

Because the Decision Tree Predictor is a classification model, it offers the configuration setting of displaying the probability of each class. In this case, each prediction was given a probability of 100%, even when it was wrong. We encourage you to explore and play around with the default configuration settings within the Decision Tree Learner node to see what gives your model the best results (for this example, we left everything default).

Decision Tree Output.png

KNIME offers two additional nodes that can be valuable for sharing and understanding your decision tree model: Decision Tree to Image and Decision Tree to Ruleset. The first of these generates a dynamic image that visualizes how the records in your model were categorized. The second provides a text table that spells out the rules used to determine the classifications.

Decision Tree Image.png

Building a Time Series ARIMA Model

The final predictive model that we will look at in this blog post is the ARIMA time series model. Unlike the previous two topics, this model requires you to install the KNIME Autoregressive Integrated Moving Average (ARIMA) extension, as seen below. If you need help installing extensions, check out our blog on KNIME extensions.

Time Series Install Extension.png

A discussion on the technical details of time series analysis, whether ARIMA or ETS or any of the other models that have been formulated, would take us too far afield. If you are interested in building a time series model of any sort, we highly encourage you to thoroughly review the topic before you build any erroneous models. For this post, we will assume that an ARIMA model is the correct model for your given situation.

Once you have installed the extension you can find the new nodes in KNIME Labs → ARIMA within the Node Repository. The extension comes with five nodes, including the familiar learner and predictor nodes that we’ve already discussed. Let’s look at a quick example to see what these nodes can do.

Predicting Crimes in Phoenix, Arizona

We have a dataset containing nearly 400,000 crimes committed in Phoenix, Arizona between 2015 and 2021. Using this historical data, we would like to build an ARIMA time series model to forecast future monthly crime numbers.

Time Series Phoenix Crime Stats.png]

Within the workflow (link shared at the end of this blog), we have performed various functions to summarize the data at the monthly level. Once your data is prepped and aggregated to the correct level of granularity, building a time series model is no more challenging than either of the other models we’ve discussed so far—with one major caveat.

By default, the parameters of the time series model are set to AR = 0, I = 0, and MA = 1. If you don’t understand what these values mean right now, don’t worry. The ARIMA Learner node has no way to automatically find the best parameter values. It is very likely that these default values will lead to very poor performance. 

With that said, we recommend reading KNIME’s blog post on parameter optimization (of course there’s an extension for that!). By following these optimization steps, you can dynamically generate the optimal parameters for your model without having to guess-and-check them dozens or more times.

Time Series Workflow.png

Closing

We hope that through this blog post you gained a better understanding of how to build predictive models within KNIME Analytics Platform. While we only discussed three of the many different models that are available, the general principles seen here should hold true whether you’re building a linear regression or neural net.

If you’re interested in exploring the workflows and datasets used in this blog post visit our KNIME Hub page to download it.

phData’s team of experts is skilled at building predictive models in KNIME.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit