November 16, 2021

A Beginner’s Guide to Alteryx: How to Use Regular Expressions

By Jake Miknuk

In this article, we will be going over some basics of using the parse functionality of the RegEx tool in Alteryx. 

What is Parsing?

Parsing data is the process of converting data that is retrieved from somewhere into a format that is acceptable for analysis and more readable. 

You might wonder what regular expressions are and why anyone would need to use regular expressions for parsing when there are many string functions available in Alteryx? Also, if you are a fan of string functions, I have good news for you, string functions will typically run faster than a regular expression. 

I’m really selling regular expressions at this point, aren’t I? Well, the beauty that gives regular expressions their superpowers lies in the fact that you can generate fairly complex operations with less code and use less of your system’s processing power to accomplish the task. 

This is useful for complicated situations when a string manipulation or the trusty text-to-columns tool just won’t cut it. Regular expressions can seem a little daunting but hopefully, we can alleviate some of your fears about using them so that you can add RegEx to your data preparation tool belt. 

What are Regular Expressions?

Regular expressions, most commonly referred to as RegEx (pronounced: Rej-Ex), are a sequence of characters that allows the user to create patterns that help match, locate, and manage any string data. 

In Alteryx, you can use the RegEx tool to replace, tokenize, parse, and match string values. Using regular expressions within the formula tool opens up even more flexibility, which we can discuss in a subsequent article. 

How to Use the RegEx Tool in Alteryx

For now, let’s take a look at the RegEx tool’s configuration in Alteryx. 

Looks basic enough, right? We have a brief description of what the tool allows you to do: Build an expression to parse, match, or replace data. A drop-down to choose the column you want to apply your regular expression to (here, we use a column titled test). The area where you can write your regular expression (case insensitive is selected by default). A drop-down to choose your method of output. And finally, the output columns that are generated from your regular expression. You can also assign a new alias to the output columns. 

Not only do you get the functionalities above, but the RegEx tool also gives you some help when it comes to writing your expression. Just click on the Plus (+) button included in the Format to Convert section of the configuration. There, you will see that you get some commonly used expressions and a quick description of the expression (shown below). 

Example of Using RegEx in Alteryx

Let’s take a look at an example that Alteryx provides to understand what a simple regular expression looks like and its results. We start with a data set that includes CustomerID, Name, and Address.

Most of the zip codes in the Address field have a format of five numbers, a hyphen, and then four numbers. How do we create a column that only includes the final four numbers of the zip code? Let’s create and analyze a regular expression that parses out those four numbers. 

The regular expression used here is -(\d{4}). Let’s break this syntax apart to understand what’s going on here.

  • The expression starts with a hyphen, -. Notice that the hyphen is outside of the parentheses of the main part of the expression. This is important because this expression is looking for any instances of a dash followed by what is inside of those parentheses, referred to as a marked group in RegEx.

  • Now we look at what is inside of the parentheses; the marked group. This is what will eventually get parsed out.

    • \d denotes a digit, any number from 0-9.

    • Inside of the braces, or curly brackets, we have the number 4; {4}. This denotes that we are searching for a quantity of four digits since it is in a grouping with the \d syntax.

Finally, we can examine our output to this simple regular expression parse below. 

Notice how there is a new column that we have named zip4 that contains the last four digits of the zip code when that pattern exists and a [Null] value when it doesn’t.

This was a fairly simple example of RegEx parsing. Let’s dive into a more complicated example. Here is a list of books and their details obtained from a sample XML code on the Microsoft website

Sidenote: This XML is in an easily readable format for Alteryx already. If you were to input the file using an input tool, Alteryx would automatically read in the columns and records using the child values.

As you can see here, this is not the most easily readable format. Let’s parse this out so that we have columns for Book ID and Description. I would recommend testing your syntax using a web tool like https://regex101.com/. This tool will allow you to write your regular expression and highlight matches and groupings. I particularly like this tool because it gives you an explanation of what’s going on. For example, here is an expression that parses out the Book ID field.

In Alteryx, it would look like this. 

We use similar concepts as the first example to parse out the Book ID here. Since the XML code has a similar format for all, we can type in the beginning string that we want to look for. In this case, it is <book id =". From there, we create the marked group, (\w{2}\d{3}). This group is the Book ID, and the regular expression is finding anything that has two letters and three digits, matching it, and then parsing it out. 

In this example, we can leave out the beginning of the string and just use (\w{2}\d{3}) to get the same result since there are no other examples in the format of two letters followed by three digits. However, as a best practice, it will be best to leave in the literal string, <book id =", just in case there is data that matches the marked group format and because we know we are only looking for the Book ID records. As you can see in the results window above, we now have three book IDs in separate columns, making them much easier to read and use for analysis. 

Next, let’s create a new expression to parse out the Description fields. This one is a little tricky because we are attempting to parse out the entire sentence. Depending on what the string looks like, writing the regular expression for this is contingent on what the definition of a sentence is in the context of the entire string you are attempting to parse. For this example, a sentence begins with a capital letter and ends with a period. Under that assumption, the expression below is what we come up with. 

				
					<description>([A-Z][^.]*\.)
				
			
  • The beginning of the string we are searching for is <description>, so we add the literal string outside the marked group to ensure that the regular expression begins wherever it finds that string.

  • As mentioned before, we add parentheses () around the syntax that will be the marked group.

  • Inside of the marked group in this expression, [A-Z] will recognize the leading capital letter.

  • [^.]* will find any character that is not a period, and the asterisk denotes that it is looking for this occurrence zero or more times.

  • We end the expression for the marked group with a backslash and period, \.

    • This denotes that it is looking for the expression to end with a period.

Before we move on, let’s go over escape characters quickly. The backslash, \ , is an escape character in RegEx. An escape character is used before a character that has a special meaning in RegEx syntax. For example, a period in RegEx syntax denotes any character. When we use a backslash in front of the period in our example, we are searching for the literal period character in the Description field. 

All of the following characters have a special meaning in RegEx syntax and must be escaped by a backslash to reference the literal character. 

				
					. ^ $ * + - ? () [] {} \ | _ /
				
			

Okay, now that we have gone over escape characters, let’s get back to it. Below is the result of the regular expression that we have created. 

We are almost there! If we were to test the above syntax in a tool like https://regex101.com/, we would have a matching result that parsed out the correct string we are looking for, the description sentence. However, Alteryx reads in the embedded newlines. Luckily the data cleansing tool has a nice option to remove the additional tabs and whitespace, causing these embedded newlines to appear. By selecting the Description field and checking the box for Tabs, Line Breaks, and Duplicate Whitespace, we can eliminate the embedded newlines and make the parsed-out sentence look a lot cleaner.

Final Words

And that wraps it up! Using RegEx to parse strings in Alteryx is an incredibly powerful tool that you can use in addition to the various string functions in a formula tool. 

In the next article, we will go over the matching functionality in the RegEx tool by creating a Social Security Number validator. Stay tuned, and thank you for reading!

Curious if Alteryx is right for your organization?

phData can help, especially around how to get the most value out of the Alteryx platform and its features. Reach out today for answers, best practices, and actionable Alteryx advice!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit