Why this Series?
Katrina and I were chatting a couple of weeks ago. Though she is a skilled Tableau Developer and Snowflake user, she mentioned that she couldn’t find many details on optimizing performance on Tableau when using Snowflake as the underlying source. When should you have a live connection versus extract? How do you maximize performance once you decide on one over the other?
I didn’t know of any explicit resources either. However, I have had to fill this role for numerous clients. I have provided recommendations and training on optimizing performance from Tableau to Snowflake on multiple projects. We decided consolidating the information in one location to help ask as many people as possible was a good idea.
Today, we will talk about optimizing performance when connecting Snowflake to Tableau.
In the previous topic, we talked about some general rules regarding when to use an extract versus a live connection. However, just because we select one of the others doesn’t mean that our queries in Tableau will perform well. If that is the case, how do you optimize performance?
First, focus on what you can control. A dashboard designer is going to be able to optimize performance through good dashboard design. They may not be able to optimize the underlying data source without involving the data engineering department.
Second, Tableau has its limitations. Two related items come to mind here. First, the more marks you have in your individual visualizations, the slower they will be. A mark could be a dot on a scatter plot, a bar on a bar graph, or a cell in a crosstab table. A visualization with one thousand marks will be slower than a visualization with one hundred.
I have seen, numerous times, individuals trying to get a visualization with hundreds of thousands or millions of marks, often rows in a cross table, trying to render. Tableau’s purpose isn’t showing very granular use cases. There is almost no “fix” for performance when rendering this many marks. This performance is a design issue. Tableau is designed to show aggregate data in nice visualizations. It fails when we try to use it to generate detailed crosstabs. It is not its purpose. Please stop trying.
Sizing a Snowflake Warehouse
As we mentioned in the previous topic, you can scale up Snowflake to get better performance. So what does this exactly mean?
Snowflake calls the power behind the scenes that run queries a warehouse. This term can be a little confusing as a data warehouse generally means the entirety of the technology running and storing the data. However, in Snowflake, it refers to the processing power of an individual machine running queries.
Snowflake uses “t-shirt sizes” to reference the relative power of the warehouse processing queries — Extra Small (XS) to Small (S) to Medium (M), and so on.
There is roughly a 2x relationship when going up in sizes. If it took 60 seconds to run a query on a Small (S) warehouse, it is reasonable to assume that a Medium (M) warehouse would perform in 30 seconds and Large (L) would be 15 seconds.
This rule isn’t perfect, but it is a general rule that can help us size.
But let’s say we have a Large (L) warehouse that performs in 1 second. If we upgrade to Extra-Large (XL), we would expect .5 seconds, and then XXL would be .25 seconds.
This scaling is essential to keep in mind as it eventually reaches an inflection point. The relative increase in performance is getting more challenging for a person to differentiate. Do I care (or even notice) the difference between 1 second and a half-second? Probably not.
But you might be wondering, well, what’s the harm in sizing a database to XXL. I might have one query that processes 100 records and one query that processes a billion. Sure, I need the XXL for the billion-row query, but what’s the harm of putting the 100 record query on this warehouse too. Every query could potentially run super quick, and we would need only one warehouse.
Well, like a lot of things in the world, the issue is money. As we increase the size of the data warehouse, we are paying more. We would be paying too much to run that 100-row query on the XXL warehouse. You are paying extra for minimal improvement. You likely wouldn’t notice a difference between an XS and XXL warehouse when processing just 100 rows of data.
It is super important to “right-size” your queries on Snowflake. It will save your organization money in the long run.
How to “Right-Size”?
First, you need to determine what is acceptable. With a query that runs once a month by a handful of people, performance of 30 seconds might be okay. Yes, they have to wait, but not that many people are waiting, and they aren’t waiting that often.
Now, if many people use the same dashboard dozens of times a day to help make decisions, 30 seconds every time is not acceptable. The group will be very frustrated.
Setting the service-level agreement (SLA) for performance-based off on audience size and usage patterns is essential. This SLA is going to help influence optimal warehouse size.
Once you have set these SLAs and have communicated the expectations to your user base, right-sizing the warehouse is relatively simple.
First, make sure you have optimized the data source and dashboard design as much as possible first. We will have more on that in other posts. Making sure that these two items are optimized first before increasing the warehouse-size is essentially a courtesy. The structure of your data and the design of your dashboard will have significantly more impact on performance than the power of the Snowflake warehouse itself. Remember that your organization pays more for bigger warehouses. Be sure to optimize performance with the things you can directly control before asking IT for a bigger warehouse.
Once we have confirmed that the data source and dashboard design are as good as possible, we can start looking at the warehouse size. Run the warehouse on an extra-small (or the smallest size you have available). If the performance doesn’t meet your SLA expectations, increase the size of the warehouse. Stop once you get to the performance that you need. Don’t go any bigger.
Note that you will likely need to work with a Snowflake IT admin as most users don’t have permission to change the sizes of warehouses themselves.
Be sure to test performance when the Snowflake isn’t caching.
One more thing about warehouse sizing: the sizes also generally are different depending on the environment. A test environment or development environment usually doesn’t need to be as quick as a production environment. The size of the test environment warehouse probably doesn’t need to be as big as the production environment.
It's Not Just About the Number of Rows Returned
Okay, so you have a query that always returns 100 rows or less. That should always perform great, right? Well, unfortunately, no.
This idea is a common misconception I see when I help people optimize the performance of their Tableau dashboards. “My visualization view only has ten marks. Why does it take so long to render?”
Well, the answer is tied to “How many rows of data are you processing/iterating through to get those ten marks?” Let’s take a look at three scenarios where this can come into play.
Scenario #1: Tableau Data Source
Remember that Tableau likes to aggregate data. The number of records Tableau shows in a visualization is equivalent to the number of unique combined values of the dimensions that are on columns and rows. Your data source could return values within a Table like the one below:
That doesn’t mean, however, that your underlying data set only contains three records. It means that it includes at least three records. It could have 100, 100,000, 1 million, or 1 billion records. Tableau aggregates.
Now the performance processing time is going to be, in part, due to how many records Tableau (and the underlying database like Snowflake) needs to process. 1 billion records is going to be slower than 1 million records.
Using joins in the Tableau data source compounds this potential scenario, mainly if it is a many-to-many join relationship. That can arbitrarily create a data set that is exponentially bigger than expected.
Now, I should say restricting the amount of data returned by having it be highly aggregated is a good idea, no matter the size of the database. We will talk more about that in a later topic. We are simply noting that query performance is proportional to the size of the underlying data set.
Scenario #2: Using a Custom Query
Most analytics tools, Tableau included, support the use of custom queries. A custom query is where the user writes the SQL query that’s sent directly to Snowflake. Is this a good idea for performance? Almost always, the answer is no.
If you are an amateur SQL writer, please assume that Tableau is probably better at writing SQL than you are. Sometimes, Tableau does write “bad” unoptimized SQL. But most of the time, the SQL is perfect.
The second thing is custom queries, particularly ones that contain join logic, circumvent fundamental concepts. Join culling and relationships. Both of these concepts, though different in practice, do the same thing. They optimize query performance by only grabbing details from the tables needed for that visualization and ignore the tables that are not required.
These concepts are helpful for performance. Joins, when added in a custom query, are always executed on every visualization. Whether or not the data from the second, third, or fourth table is needed to calculate the result. Joins often increase the number of rows returned or processed in the returning set. Custom queries that have joins will always process and return these records. If we build the data set in Tableau, instead of using a custom query, Tableau can take advantage of either join culling or relationships. Tableau can then optimize performance by minimizing the number of rows that are processed.
Scenario #3: Views in Snowflake
A view is essentially a curated custom table. We create views within our database, like on Snowflake, to have a custom version of our data. A view might join multiple tables or aggregate the results to a certain level. We create views when it doesn’t make sense to create a net-new table, as the information already exists somewhere, but we need to combine or manipulate the data in a certain way.
In general, with databases, there are two types of views: a materialized view and a standard view. A materialized view creates a copy of the data and stores it within the database. In Snowflake, it will also recreate that datastore every time the underlying table(s) is updated.
A standard view doesn’t store the data; it just keeps the logic to create the view. So every time someone accesses the view, it executes the logic to return the results.
Like with the two previous examples, the number of rows returned is not usually the contributing factor to bad performance. Performance degradation is proportional to the number of rows that need to be processed to generate the results. You wouldn’t have this processing with a materialized view.
The main difference is that a materialized view, though performs better, is that the data updates in the view whenever the underlying data in the table is updated. This takes additional resources to keep the view up to date. A standard view, though may perform more slowly, does not need to be kept up to date. It stores the logic and will bring back the most recent data from the tables.
It is not generally advisable to use materialized views in Snowflake as a default. A Snowflake warehouse must be running to keep the view up to date. This circumvents the main advantage for Snowflake, which is that you only pay for what you use. If we keep a warehouse on to update materialized views, we could be paying more than we need to.
Here are some general use as to when to use one type of view over another in Snowflake.
Materialized views should be used when:
- Underlying data doesn’t change often, so a warehouse to update the view doesn’t need to run often.
- The view itself is used regularly, as this justifies creating one for its performance advantage at the cost of regularly running the warehouse.
- The underlying query to create the view is complicated.
Standard views should be used when:
- Underlying data changes often.
- The underlying query to create the view is uncomplicated.
- The underlying query to create the view is used infrequently.
Generally speaking, try to use a standard view if you can. If you run into performance issues, consider moving to a materialized view. If you do run into performance issues, switching to a materialized view can be a good option. However, remember that the materialized view is more expensive from a warehouse perspective, as whenever the underlying data is updated in the table, we need to have a warehouse update the view as well.
Materialized views are also only available as a premium paid feature in Snowflake. It is not an option for many organizations. If your performance of a standard view is slow, and creating a materialized view is not an option, you can always create a new table instead. A table is always materialized.
Performance Optimization in Tableau
Though we are talking about it last, optimizing performance within Tableau is probably the most important and impactful way to increase performance when connecting to Snowflake or any data warehouse system. We have already talked about what is specific to Snowflake. Let’s quickly cover best practices independent of the underlying systems.
Item 1: Number of Marks
We started talking about performance optimization by talking about the number of marks in the table. Still, it is worth mentioning and stressing again, especially now that we understand more about what is happening behind the scenes — the number of marks that come into a visualization limits Tableau’s performance.
As a reminder, a mark is anything that represents a specific value within our visualization. Again, this could be a dot on a scatter plot, a bar on a bar chart, or a cell in a crosstab table.
As a general guideline queries on visualizations that show more marks take longer to run the queries that return fewer marks. This relationship is pretty evident if we think about it: the more data that has to be returned to Tableau, the more time that will likely take.
However, that isn’t the most significant issue here. Tableau is a visualization tool that has to render visualizations. Creating and displaying nice visualizations is why we are using the tool. Generating visuals happen in Tableau proper and has nothing to do with the underlying database.
The more marks we have, the longer it will take Tableau to create that visualization. Tableau Desktop or Tableau Server will have to use its machine’s resources, mostly RAM and your graphics card, to generate these visualizations. The more marks we have in a visualization, the longer it will take to render.
If this is the leading cause of subpar performance, there is nothing you can do except redesign your dashboard. There is nothing you can do within Snowflake to fix this issue. Creating visualizations with hundreds of thousands or millions of marks, such as within a large table with a ton of rows and columns, is not Tableau’s strong suit. Showing aggregate information within an interactive context is Tableau’s purpose.
Item #2: Calculated Fields
The second item for Tableau performance that I always look at is calculated fields. Specifically, nested calculations.
A nested calculation is when we execute the same type of operation twice within the same calculated field. For example, if we have a level-of-detail (LOD) calculations within another LOD expression. Or if we have nested if-else or case conditional statements.
IF SUM([Sales]) > 10000 AND (IF SUM([Profit]) > 5000 THEN TRUE ELSE FALSE END) THEN 'GREEN' ELSE 'RED' END
The calculated conditional field above contains a nested condition. Essentially, in if-else within another if-else. We could easily rewrite this condition to be the following:
IF SUM([Sales]) > 10000 AND SUM([Profit]) > 5000 THEN 'GREEN' ELSE 'RED' END
You would get the same results, but you don’t have the nested calculation. Now, this is an oversimplified example. Most people probably wouldn’t write the calculations the first way and instinctively write it the second way. However, for more complicated calculations, this principle may not be as intuitive.