January 1, 2022

Tutorial: Circular Sankey Diagram in Tableau

By Luke Stanke

So you are probably tired of regular old sankey diagrams. What If I told you there is a way to take your Sankey’s to the next level by combining it with something everyone loves: circles.

Before I start the tutorial, I want to provide a bit of code to prep the data (and a link to the data itself if you don’t care about the prep). I used superstore data and summarized it to show the total number of products ordered by customer. 

				
					library(tidyverse)
library(magrittr)

product <- rio::import("/2018.3/en_US-US/Sample - Superstore.xls") %>%
  as_tibble()


product %>%
  group_by(`Customer Name`, `Product Name`) %>%
  summarise(Quantity = sum(Quantity)) %>%
  left_join(
    product %>%
      group_by(`Customer Name`) %>%
      summarise(cust_quantity = sum(Quantity)) %>%
      ungroup() %>%
      arrange(desc(cust_quantity)) %>%
      mutate(cust_rank = row_number(-cust_quantity))
  ) %>%
  left_join(
    product %>%
      group_by(`Product Name`, Category) %>%
      summarise(prod_quantity = sum(Quantity)) %>%
      ungroup() %>%
      arrange(desc(prod_quantity)) %>%
      mutate(prod_rank = row_number(-prod_quantity)) %>%
      group_by(Category) %>%
      mutate(
        prod_cat_rank = row_number(-prod_quantity),
        tmp1 = max(ifelse(Category == "Office Supplies", prod_cat_rank, 0)),
        tmp2 = max(ifelse(Category == "Furniture", prod_cat_rank, 0))
      ) %>%
      ungroup() %>%
      mutate(
        tmp1 = max(tmp1),
        tmp2 = max(tmp2),
        prod_cat_rank = ifelse(Category == "Office Supplies", 0, tmp1) + prod_cat_rank,
        prod_cat_rank = ifelse(Category == "Technology", tmp2, 0) + prod_cat_rank,
        tmp1 = NULL,
        tmp2 = NULL,
        one = 1
      )
  ) %>%
  write_csv("https://i0.wp.com/www.phdata.io/cust_prod_mapping.csv")

data_frame(
  one = 1,
  t = c(seq(0,1, by = .01), seq(2,7))
) %>%
  write_csv("https://i0.wp.com/www.phdata.io/generic_data_model.csv")
				
			

Here is what the data will look like:

				
					# A tibble: 9,941 x 10
# Groups:   Customer Name [?]
   `Customer Name` `Product Name`            Quantity cust_quantity cust_rank Category   prod_quantity prod_rank prod_cat_rank   one
                                                                                  
 1 Aaron Bergman   Akro Stacking Bins               2            13       752 Office Su…            13      1295           777     1
 2 Aaron Bergman   "Carina 42\"Hx23 3/4\"W …        3            13       752 Office Su…            22       691           441     1
 3 Aaron Bergman   Global Push Button Manag…        1            13       752 Furniture             44        62          1070     1
 4 Aaron Bergman   Newell 330                       3            13       752 Office Su…            11      1472           866     1
 5 Aaron Bergman   Samsung Convoy 3                 2            13       752 Technology             9      1576          1759     1
 6 Aaron Bergman   Sauder Facets Collection…        2            13       752 Furniture             10      1529          1387     1
 7 Aaron Hawkins   "ACCOHIDE 3-Ring Binder,…        3            54       266 Office Su…            44        60            42     1
 8 Aaron Hawkins   Avery 51                         3            54       266 Office Su…            16      1081           649     1
 9 Aaron Hawkins   Avery Fluorescent Highli…        7            54       266 Office Su…            24       577           369     1
10 Aaron Hawkins   EcoTones Memo Sheets             2            54       266 Office Su…            15      1165           705     1
# ... with 9,931 more rows
				
			

The most important thing is that you have two sets of ranks: a source rank (customer) and a destination rank (product).

In order to make the Sankey we need to do some basic data modeling, so here’s my data model:

				
					data_frame(
  one = 1,
  t = c(seq(0,1, by = .01), seq(2,3))
) %>%
  write_csv("https://i0.wp.com/www.phdata.io/generic_data_model.csv")
				
			

The data created has two columns, the first column, labeled one, is a column of the integer 1. We’ll use it for joining to the other dataset. The other column, t, counts up from 0 to 1 by .01, and then has additional values of 2 and 3.

Let’s join the two datasets together.

When it comes to the data modeling, I like to draw lines using when t ranges from 0 to 1. I then like to add circles before and after indicating the intensity of the path. Use t = 2 and t = 3 for this. Specifically, t = 2 will be a circle for the customer, and t = 3 will be a circle for product. Let’s create a useful calculation that separates out the circles from the chords. Let’s call it [separator].

				
					// Separator
IF [T] <= 1
THEN "chord"
ELSEIF [T] = 2
THEN "start point"
ELSEIF [T] = 3
THEN "end point"
END
				
			

Just like the last blog post, we’re not actually going to use all the data, we’re going to create a boolean to return just the top ranked values. We’ll use a parameter to dynamically select those values. Our parameter is called [Select Rank], it’s an integer that run from 0 to 400. Then we have this calculation:

				
					[Prod Rank] <= [Rank Select]
AND [Cust Rank] <= [Rank Select]
				
			

We can put it on filters right away. Now let’s make our Circular Sankey Diagram.

First the setup. This formula will allow us to make the pretty curved lines everyone gets excited about.

				
					// t (logodds)
IF [T] > 0 AND [T] < 1
THEN LOG([T]/(1-[T]))
END
				
			

And then our basic functions one we’ll call [path] and the other [distance]. We’re going to come back to this path formula and make updates so this isn’t final.

				
					//path (to be updated later)
IF [T] <= 1
THEN (([Prod Rank] - [Cust Rank])*[T]) + [Cust Rank]
ELSEIF [T] = 2
THEN [Cust Rank]
ELSEIF [T] = 3
THEN [Prod Rank]
END/[Select Rank]
				
			
				
					//distance
IF [T] <= 1
THEN 2.5  -((1/(1 + exp([2a. t (logodds)])))) * 2.5
ELSEIF [T] = 2
THEN .23
ELSEIF [T] = 3
THEN 2.3
END
				
			

This produces the following sankey diagram. This is nice and all, but I want CIRCLES!

And here’s the thing: [path] is basically our angle. And [distance] is our radius. So we can throw these formulas into a basic formula for a circle

				
					//x
([distance] + 1) * COS([path] * 2 * pi())
				
			
				
					//y
([distance] + 1) * SIN([path] * 2 * pi())
				
			

Finally a [size] calculation for sizing the circles and the chords.

				
					// size
IF CONTAINS([1b. separator], "bar")
THEN 20
ELSEIF CONTAINS([1b. separator], "start point")
THEN {FIXED [Cust Rank]: SUM(IF [T] = .5 then [Cust Quantity] end)} * 3
ELSEIF CONTAINS([1b. separator], "end point")
THEN {FIXED [Prod Rank]: SUM(IF [T] = .5 THEN [Prod Quantity] end)} * 3
ELSEIF [1b. separator] = "chord"
THEN 100
ELSE 5
END
				
			

With this, its time to build!

Create a line chart. Add [customer name], [product name], [separator], [number of records] and [customer rank] all to detail as dimensions. Add [y] to rows and [x] to columns as continuous values add [t] to path and [size] to size.

Now let’s class it up. Let’s create a set action on customer rank. Create a set with customer rank, add it to color, make it an action. This will allow you to select a customer and see the products the customer has purchased. Its a nice touch, but if you are like me the one thing that’s bothering me is the “bite” in the chart and the fact that the path from customer to product isn’t taking the most efficient route.

This is because the path from customer to product isn’t the most efficent. When we make a regular sankey you have only a single path that you can follow. When we make a circular sankey you have two paths — clockwise or counter-clockwise. Our base calculations only take into account for a single path direction. To fix this is actually pretty easy.

First we need to identify the cases at issue. In some cases we have a path that goes counter-clockwise but it should be going clockwise via a shorter path. The second issue is the the opposite, your classic visa-versa!

So how do we idenifty these cases — its actually really easy. If ([cust rank] – [prod rank])/[rank select] is greater than .5 or ([cust rank] – [prod rank])/[rank select] is less than -.5 then the paths should be flipped.

Lets first indentify these cases with a [Rank Delta] calculation:

				
					//Rank Delta
(([Prod Rank]/[1a. Rank Select]) - ([Cust Rank]/[1a. Rank Select]))
				
			

Then we just update our path calc to tell us when they need to be flipped. And when they need to be flipped all we need to do is add or subtract the [Rank Select] values to the product ranks. This is the beauty of circle, to change directions we just need to add or subtract the overall total.

				
					//path
IF [T] <= 1
THEN IF [Rank > 50%] < -.5
    THEN ((([Prod Rank] + [1a. Rank Select]) - [Cust Rank])*[T]) + [Cust Rank]
    ELSEIF [Rank > 50%] > .5
    THEN ((([Prod Rank] - [1a. Rank Select]) - [Cust Rank])*[T]) + [Cust Rank]
    ELSE (([Prod Rank] - [Cust Rank])*[T]) + [Cust Rank]
    END
ELSEIF [T] = 2
THEN [Cust Rank]
ELSEIF [T] = 3
THEN [Prod Rank]
  END/[Select Rank Parameter]
				
			

Thats basically it. I’m just going to remove extra marks with one additional calculation and placing it on filters set to TRUE.

				
					//filter duplicated dots
IF ([T] = 2
AND [Prod Rank] = {FIXED [Cust Rank] : MIN([Prod Rank])})
THEN TRUE
ELSEIF (NOT [Cust Rank | Set] 
AND [T] = 3
AND [Cust Rank] != {FIXED [Prod Rank] : MIN([Cust Rank])})
THEN TRUE
ELSEIF [1b. separator] = "chord"
THEN TRUE
ELSEIF [1b. separator] = "end point" AND [Cust Rank | Set]
THEN TRUE
END
				
			

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit