January 1, 2022

Tableau Prep vs. Alteryx Designer: Basic Data Preparation – Part 2: Data Cleansing

By Aidan Bramel

If you missed Part 1, start here!

Though Tableau Prep and Alteryx have some differences in breadth and depth of product functionality, many look to the two products for their data prep needs. For that reason, for the next few weeks, let’s compare and contrast the products from the perspective of data prep alone.

In my eyes, the key capabilities of data prep are as follows:

  • Input: connecting to your data
  • Data Investigation: knowing your data, and what is wrong with it
  • Data Cleansing: string manipulation, calculations, filtering, and handling nulls
  • Data Joining & Transformation: manipulating the structure with joins, unions, or cross-tabs/transposing
  • Output: where to data ends up

In Part 1, we discussed Input and Data Investigation.

This week, I’ll discuss Data Cleansing. This is a pretty big bulk of the data prep work.

Then, we’ll get through Joining and Output and round it out with conclusions.

Establishing the Issues

In Part 1, I mentioned the data set we’re looking at is all about winners of CBS’s show Survivor. I also mentioned it is kind of a mess. After looking at the data, there’s some changes we want to make:

  • Correct data types.
  • Split up the final votes into separate fields for each final tribal member.
  • Using the above fields, determine votes received by the winner and total votes lost.
  • Replace the null values from the weird “Unanimous” field.
  • Remove redundant “Survivor: ” from Season Title.
  • Create a unique “WinnerID”

Let’s look at which tools each product has to fix those issues. This also might be a good time to point out that for many of these problems, there are several ways to go about solving them. I will do my best to point out any other major alternatives.

Again, I’ll give a disclaimer that this data has some spoilers for CBS’s Survivor so you’ll want to catch up on Hulu before moving forward.

Data Cleansing: Correct Data Types

Let’s look at which tools each tool has to fix those issues.

Tableau Prep

As you saw in part 1, we could have changed the data type right away when we inputted the data. However, it is also possible to change the data type at virtually any point in the flow.

In our Clean step from the previous data investigation, we can change the data type of each field by selecting the data type icon in the corner and reassigning the correct role. We can also assign geographic roles as they will be used in Tableau Desktop.

Alteryx Designer

For changing data types, an easy go-to tool is Auto Field. The Auto Field tool assigns the most size-efficient possible data type to accommodate the data in each column.

As you can see, we now have most appropriately sized fields. However, unlike Tableau Prep, we cannot assign geographic roles for use in Tableau.

Other Alteryx tools that can correct data types: Select Tool

The most obvious thing the Select tool can do is Select. However, you can also change data types, rename fields, and add field descriptions.

The “Select” tool’s interface is present in many other tools’ interfaces.

Data Cleansing: Split Fields

For this, I want to be able to split our Final Vote field, which contains between 2 and 3 numbers that represent the number of votes received by a player at the final tribal council. The numbers are split by a “–” and ordered by place first–second–third. Not all seasons have a third person in the final three, so that value may not always exist. 

For example, Cook Islands ended with a final three vote of 5–4–0 while Borneo had a final two with 4–3.

Tableau Prep

Tableau Prep and Tableau Desktop are pretty good at this particular activity. We will use a Custom Split with the separator “–”. We want to split with the first 3 fields. In the case of a final 2, this third field will be blank.

We will also want to change the data types on these, as they are currently Strings. They need to be numbers for what we will do next (add them). Let’s swap those, like we did last step, and rename these fields to something more meaningful. The final changes are, as always, documented in the Changes tab:

Alteryx Designer

For splitting values in Alteryx, Text to Columns tool is usually the easiest way to go. This tool allows you split values into either columns or rows based on a delimiter. 

In this case, I’m configuring my tool to split our Final Vote field, using “–” as the delimiter. We are splitting this to 3 columns. In the case of a final two, this field will just be null.

This results in three new fields: Final vote1, Final vote 2, Final vote 3. 

These will come out as String data types, so I will use the Select tool this time to change the data type to something numeric. Using this tool, we can also remove the original Final Vote field. I could also use the same Select tool to rename the auto-generated field names, but interestingly enough, these are actually somewhat meaningful so I won’t change them this time.     

Other Alteryx tools that could split our field: RegEx Tool

If you know a little RegEx, this tool can do pretty much anything. In this case, I would configure RegEx to Parse setting and use the following as my regular expression: (\d+)–(\d)–?(\d?). It’s also nice because you can edit the output columns in the same tool, mitigating the need to rename or re-type fields with Select.

Data Cleansing: Creating the "Votes Lost" field

Here, I want to create a field that will add together the 2nd and 3rd place votes, if the 3rd place votes exist.

Tableau Prep

If you use Tableau Desktop, you can probably guess where this is going: Calculated Field.

We are going to create a calculated field called VotesLost. This field will add together the votes for second place and third place. Again, Tableau prep makes this pretty easy for us. The final calculation is below.

Alteryx Designer

This one screams Formula tool, so I’m going to connect that to my workflow.

I want to create a new field, so I’ve configured my tool to add a new column and named it “VotesLost”.

Alteryx won’t add null values, I’m going to use an IF statement that will check whether [Final vote3] is null. If it is null, we’ll return [Final vote2]. Else, we’ll add [Final vote2] and [Final vote3]. The final formula is shown below, with an INT data type.

Data Cleansing: Replace Null Values

I’ve got two fields with a lot of Null values: Final Vote3 (for seasons with a final two) and this weird “Unanimous?” Field that either says “Yes” or is null. YIKES. 

There are two different objectives here:

  • Final Vote3: replace nulls with 0
  • Unanimous: change nulls to “No”

Unanimous is a field that says Y/N was the final vote unanimous? Realistically we could just remove this weird field, but humor me. I’m trying to show some examples here.

Tableau Prep

I’ll handle both of these pretty similarly in Tableau Prep: double-clicking the field and replacing it with what I want.

I could also do this with calculated fields using string manipulation and removing the old field, but why work harder than I need to?

Alteryx Designer

Final Vote3 is easy. The Data Cleansing tool was basically built for this.

To configure Data Cleansing, I select the fields I want to cleanse and decide what I want to happen to it. Here, I can select to replace my nulls with zeroes and boom – done. There’s also a host of awesome options for cleaning up your fields with a simple check. 

To configure Data Cleansing, I select the fields I want to cleanse and decide what I want to happen to it. Here, I can select to replace my nulls with zeroes and boom – done. There’s also a host of awesome options for cleaning up your fields with a simple check. 

For Unanimous, let’s break Formula back out.

Here I’m going to use a similar formula to the one I used to create our VotesLost field. I’m checking if [Unanimous?] is null, and returning “N” if it is. Otherwise, we’re returning “Y”.

Other Alteryx tools that are also meant handle Nulls:

Imputation Tool

This tool replaces problematic numeric values (such as nulls) for specified columns with another value such as the median or a user-specified value.

Filter Tool

This tool is used to filter values out of the data stream. This is often used to filter out null values.

Data Cleansing: Remove "Survivor: " from Season Titles

Time for some string manipulation! The “Survivor: ” portion of the Season Title is a bit redundant, wouldn’t you say? Let’s remove it so only the season identifier is remaining.

Tableau Prep

Remember when Tableau Prep suggested we do this? I’m just going to navigate to the recommendations and let them do all of the work.

Tableau automatically splits this into two fields for us, which is great. Now we can deselect the field that only contains “Survivor: “, rename our fields, and move on.

Alteryx Designer

Again, I look to the formula tool. I’ll actually just use the same formula from our last step to keep things condensed.

Using the Season title column, I’ll start my formula. I’m going to use the Replace function. I’m going to replace “Survivor: ” with “”. AKA, I’m replacing it with nothing.

Other Alteryx tools that could do this: Text to Columns Tool

Text to Columns Tool

If I wasn’t feeling like using the formula tool, I could use Text to Columns with a delimiter of “\s”, which signifies a spaces, and configure to 2 columns, leave extra in last column. I could then deselect the “Survivor: ” field and we’ve got the same result. Have I mentioned that I love this tool?

Data Cleansing: Add a WinnerID

As of today, there is one two-time winner of Survivor so SeasonID may be slightly different than a unique identifier for the contestant who won. So for WinnerID, we want the ID to be the season number, or for the two-time winners the first season, that they won.

Tableau Prep

This is a small data set, so this is something we can do with Prep with a calculated field.

Because 2020.1 has released LODs in Tableau Prep, I am able to come up with a WinnerID by taking the minimum value for season number ([No.]) for each [Winner]. If we did not have a SeasonID (or LODs), I’m not sure of an easy way to do this. 

Alteryx Designer

I’m going to use two tools for this: Multi-Row Formula and Sort. 

Sort is very simple to use. Here, I am sorting by Winner. Frankly, I don’t care whether it’s ascending or descending in this case, I just need to make sure all of the winners are in a row next to each other. I am secondarily sorting by Season Number. I do care that the season numbers are sorted Ascending, so the first season number will be the WinnerID.

Now for the Multi-Row Formula tool. This is a slightly more advanced prep tool. (It’s also one of my favorites)

I’ll start by configuring this to create a new field, WinnerID. To populate WinnerID, I will use the expression at the bottom of the configuration. The expression will utilize values from the prior, current, or next row in the “Variables” tab. Below, I am saying that if the winner from the previous row is the same as the winner from the current row, then give me the same WinnerID as the row with the previous row, ie the row with the same winner. If it’s not the same as the previous row, return the season number [No.].

Note that using a similar combination of tools and expressions, we could have done this without the Season number field.

Other related Alteryx tools to be aware of: RecordID Tool

Though it won’t work in this particular situation, I wanted to call out that this tool exists to assign a number to each row. This would have also been a great way to generate SeasonID if we did not have it in the data set already.

Data Cleansing: All the Other Stuff

Our walk through with the Survivor data showed us some practical ways to solve common problems, but I didn’t include all of the functionality related to data cleansing. Let’s go through some extras to call out that we missed in the walk through.

Tableau Prep

Tableau Prep has a few additional features related to data cleansing that we haven’t already touched on. 

For strings, we can do several things:

  • Filtering: by calculations, wildcard, handling null values, and selected values
  • Grouping: manually or by pronunciation, common characters, or spelling
  • Altering Case: changing to uppercase or lowercase
  • Removing unnecessary characters: remove letters, numbers, punctuation, spacing, or extra whitespace
  • We already saw this, but splitting values.

For numeric values, we can do the same filtering and group by manual selection.

All values can also be sorted either by count of domain.

Obviously, there is also a lot of Calculated Field functionality; this is very similar to that of Tableau Desktop. 

Alteryx Designer

I’ve talked about a lot of Alteryx tools so far, but there’s a whole toolkit of preparation. I’m going to cover off on some of the key remaining tools from that toolkit that I haven’t mentioned yet.

Create Samples: Splits input records into two or three random samples.

Multi-Field Formula: Execute a single function on multiple columns at once.

Generate Rows: Generates new rows of data based on a conditional expression.

Multi-Field Binning: Assigns data into bins based on values of one or more columns.

Random % Sample: Outputs a specified number or percent of records obtained via a random sample of the input data.

Oversample Field: Samples incoming data and ensures equal representation of data values for effective use in a predictive model.

Select Records: Choose a subset of records from the input stream to be output.

Unique: Separates data into two streams – duplicate and unique rows, based on the columns configured.

Tile: Group data into sets (tiles) based on value ranges in a column.

There is also functionality to both Filter and use Formula In-Database.

Like with Tableau Prep’s calculated fields, Alteryx’s formula tool has a similar wide array of options.

Conclusions Thus Far

We’re not done yet: next time we will be discussing Transformation (Crosstabs, Joins, Unions) and Output. But what are some highs and lows so far?

Tableau Prep

Pros:

  • Very user friendly and intuitive
  • The addition of LODs has greatly increased ability to solve problems

Cons:

  • Limited options for multi-row calculations
  • Limited options for more advanced data prep
  • Less options for decreasing data type sizes

Alteryx Designer

Pros:

  • Greater flexibility in options for solving more complex problems
  • Advanced options for multi-row and multi-column formulas

Cons:

  • More configuration steps to make changes
  • New users may be overwhelmed by large number of tools

So far a key take away is that Tableau Prep has generally had the upper hand on ease of use, while Alteryx is clearly reigning over depth of functionality. We’ll see if that pattern continues in the next posts.

What are your thoughts so far? Keep your eyes out for Part 3 coming soon.

Do you have more questions about Tableau or Alteryx? Talk to our expert consultants today and have all your questions answered!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit