January 1, 2022

How to Determine the Best Regression Model: 4 Tools in Alteryx

By John Emery

As a data and analytics consultant, I interact with clients from a wide array of backgrounds. Some are highly technical while others wouldn’t know a normal distribution if it bit them on the nose. This is often a good thing in practice, as diverse teams can return more creative ideas than a team that consists of people with the exact same experiences.

A problem can arise, however, if a person with little or no statistical knowledge is required to build a predictive model (this happens more often than you might think). If you knew little about statistics and modeling, where would you turn? Thankfully, options today are plentiful and with tools like Alteryx anybody can build a predictive model in mere minutes with no coding experience required.

What are Regression Models?

There are many types of predictive models, from linear regression to neural networks to time series forecasts. For the purposes of this post, we will focus on a particular subset: regression models. These models are what most people are familiar with and produce (relatively) easy-to-understand outputs. It is also much easier to explain what is going on in a linear or logistic regression than, say, a neural network (which are basically magic).

Alteryx offers the following regression tools, which we will look at more in-depth below: linear regression, logistic regression, count regression, and gamma regression. In this post, we will focus on scenarios when each of the regression models would be appropriate to use (otherwise this post would turn into a novel).

Downloading the Predictive Tools Package

Before we get started, you may need to download the Alteryx Predictive Tools Package. If your Predictive tool palette doesn’t look similar to the image below, you will need to download the package.

Alteryx has a helpful guide to download the package here. Once you have successfully installed the predictive tools package, you may continue!

Regression Models

Two of the most common and widely used types of predictive analyses are regression and classification models. A classification model attempts to place a given data point into one of two or more classes (won/lost, survived/died, mammal/reptile/fish, etc.). Regression models, on the other hand, attempt to estimate the value of a dependent variable based on its relationship to one or more independent (predictor) variables. You could build a regression model to estimate a person’s weight based on their height, age, and gender, for example.

Regression models are frequently used for predictions and forecasting, which go hand-in-hand with machine learning. Regression analysis can also be used to identify relationships between variables to infer causality. 

In the sections below, we will walk through the regression models that are available in Alteryx. In general, the four regression tools that we will discuss are used in specific situations, and they are not interchangeable. For instance, in a case where linear regression is appropriate a logistic regression is most likely inappropriate to use. Make sure to take close note of which regression tools are appropriate for certain situations.

Linear Regression

The first tool we’ll discuss is the most widely known and used regression tool: linear regression. If you ever took an elementary statistics course in high school or college you learned about linear regression.

Linear regression is used to estimate the relationship between a dependent variable and one or more independent (explanatory) variables. Models that have only one explanatory variable are called simple linear regression while models with more than one are known as multiple linear regression. As its name suggests, a linear regression model is most appropriate when the variables exhibit a linear relationship.

In the image below, we have plotted the maximum wind speed versus minimum pressure of Atlantic Ocean hurricanes. We can clearly see that, as the wind speed increases, minimum pressure decreases in a linear fashion. There are no outliers and the data follows an obvious linear trend. This data set would be an ideal candidate for a linear regression model.

A linear regression here could reveal to us that we expect a storm with 100-knot winds to have a minimum pressure of about 957 millibars.

Not all relationships are a good fit for linear regression, however. Take the following data set for instance, which shows the number of bald eagle breeding pairs in lower 48 states by year. The line drawn represents the line of best fit, which results in very large residuals (i.e., errors) in the earliest and most recent years. A data set such as this would be much better modeled using a non-linear regression model (exponential, to be exact).

Logistic Regression

Another commonly used and easy-to-interpret regression model is the logistic regression. You would use a logistic regression to model the probability of a data point being in one of two possible states.

You could use a logistic regression to model the likelihood of passing a test (pass/fail) based on predictor variables such as time spent studying, grades from last year, and the proportion of other students who passed the course.

Take the following table which lists the numbers of hours studied for a course for 20 students and their results (0 = fail, 1 = pass):

Looking through the numbers, we can clearly see that, as students spent more time studying, they passed the course more frequently. For instance, no student who studied for at least 10 hours failed the course.

A logistic regression returns the probability of a variable being in one of two states. Thus, if we plot these variables on a chart with the logistic regression curve on top of it, we can estimate the probability of passing or failing the course for any number of study hours.

In the image above, the dashed blue line represents the logistic regression curve. Notice how it is bounded by 0 and 1 on the bottom and top, respectively. Had we run a linear regression instead, the blue line would go toward negative and positive infinity, even though values outside of the range 0 to 1 are nonsense.

Here, we can see the probability of a student passing given they studied for 6 hours is about 27.5%. This probability rises to about 97.5% for students who study for 12 hours.

Count Regression

Unlike linear and logistic regressions, which are quite well known, count (and the upcoming gamma) regression models are seldom heard from. Still, they occupy an important place in predictive modeling. 

Count regression models are used when you want to estimate a small non-negative integer, such as the number of calls into a call center for a given morning or the number of visitors to a particular check-out line in a grocery store. Basically, if your dependent variable can take on values other than {0, 1, 2, 3, …} a count regression is not appropriate.

Imagine you have a data set that documents driver ages, car models, and age of the car as categorical variables along with the number of insurance claims for each combination as seen below.

We know that the number of claims cannot be a negative number, nor can it be a non-whole number. This is a good candidate for a count regression. Much like the linear regression that we saw previously, the count regression returns the estimated number of claims for each OwnerAge-Model-CarAge combination. The danger with using a linear regression here is that it could return estimates of negative claims, which is nonsensical with this data set.

The reason for this is that the count regression tool uses one of three probability distributions: Poisson, Quasi-Poisson, or Negative Binomial. All three of these distributions only allow non-negative whole numbers as values. The linear regression, on the other hand, uses a normal distribution, which has no issues with fractional and negative values.

Gamma Regression

Gamma regression is another highly specialized statistical method. You would consider using a Gamma regression if your target variable can only take on strictly positive values. That is, it can never be zero or negative. Unlike the count regression above, though, it can take on fractional values, such as 1.5 or 3.1415.

A Gamma regression is generally used when your data is right-skewed; that is, when there are relatively many small values and relatively few large values. Graphically, the data may look like the chart below, which shows the amount of rainfall per day in half-inch increments. The large majority of observations fall in the [0 – 0.5] and [0.5 – 1.0] buckets, while very few days recorded more than 2 inches of rain.

With a data set such as this, we can use other weather observations to attempt a prediction for the amount of rain in a day based on explanatory variables such as humidity, wind speed, cloud cover, and temperature. 

Final Words

When you need to build a predictive model one of the most important decisions you must make is what type of model is most appropriate. In this blog, we went over Alteryx’s four regression tools—Linear, Logistic, Count, and Gamma—and described when each would be an appropriate model to use.

Although we didn’t cover it in this post, there are other predictive models available in Alteryx: Boosted Models, Decision Trees & Random Forests, Neural Networks, and Support Vector Machines. It can be very tricky to determine which of these may be better than, say, a linear regression model. In that situation, you can leverage one of Alteryx’s most useful tools—Model Comparison. These models are more complex but can potentially achieve more accurate results than a standard regression model. 

Do you have more questions about Alteryx? Talk to our expert consultants today and have all your questions answered!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit