Why this Series?
Katrina and I were chatting a couple of weeks ago. Though she is a skilled Tableau Developer and Snowflake user, she mentioned that she couldn’t find many details on optimizing performance on Tableau when using Snowflake as the underlying source. When should you have a live connection versus extract? How do you maximize performance once you decide on one over the other?
I didn’t know of any explicit resources either. However, I have had to fill this role for numerous clients. I have provided recommendations and training on optimizing performance from Tableau to Snowflake on multiple projects. We decided consolidating the information in one location to help ask as many people as possible was a good idea.
To start, we will talk about extracts and live connections when connecting Snowflake to Tableau.
Extract Versus Live Connections
Is a data extract better than a live connection for performance? Unfortunately, there is not a straightforward answer. There is a lot of ambiguity here. Analytics center-of-excellences often struggle with answering this question. This blog series gives insight into when one is better than the other.
In general, extracts (such as Tableau hyper files or Power BI’s Import) will perform better with small to medium-sized data sets. This is because:
- Extracts are optimized for storage and performance.
- A live connection will always have a constraint of upload and download network and internet speeds.
Extracts will almost always perform better than a live connection when the data set is small to medium-sized for two reasons. First, extracts optimize storage and performance. The second reason is network latency. There will always be a constraint of upload and download network and internet speeds.
This naturally raises the question: “What is a small or medium-sized data set anyways? How can you quantify it?”
Quantifying is pretty hard to do because it depends on a lot of factors. Generally, when I say “medium,” I am talking about 100s of thousands of rows to low millions with a reasonable amount of columns, two dozen or less. However, there are always going to be exceptions to this rule. Database and dashboard design generally have a much more significant impact on performance than the number of rows and columns.
Eventually, as the data set size increases, there is a “point-of-no-return.” Underlying databases are much better at processing for the largest of the large data sets. Crashing Tableau Desktop or “not-enough-RAM” errors are pretty good indicators that an extract is not well suited for the scenario. Underlying databases are optimized for processing large queries, whereas our desktops are not. Even Tableau Server, which can process much more than a local machine, pales in comparison to a dedicated analytics server
Essentially, there is a point where the processing power of a database offsets the potential network latency delays, and it is better to switch to a live connection again.
Within all the ambiguity of data sizes, databases, and dashboard design there are several general rules we can follow to ensure we have the most optimal connection. For the sake of this series, we are going to focus on cloud-based databases, using Snowflake as an example use case.
General Rule #1:
Extracts are better than live connections when the data is small to medium unless you’re using a database specifically optimized for analytic queries.
Since Snowflake was built for data analytics, it will eventually perform comparatively better as the data set gets more extensive. Where a live connection for Cloudera Hadoop may have been impossible, it may be reasonable with Snowflake. There are a couple of core reasons why we select Snowflake for analytics.
First, the architecture of Snowflake avoids resource contention well. It intelligently allows multiple people to send queries to the same data source simultaneously. Consequently, things don’t get stuck in a queue as often.
Second, cloud systems, like Snowflake, are built to scale up and out. Scale-up means the machine running the individual query can return the results quicker. Scaling out means we can add more machines, meaning we can handle more concurrent queries from multiple people.
Third, Snowflake has a solid caching system. Caching is when the database remembers results for common and/or recent queries. If someone else requests those same query results, it will return them nearly instantaneously.
This caching system is helpful with Tableau. The initial queries on a Tableau dashboard from Tableau Server are rarely unique. It might be slow for the first or second people who log in for the day. However, it should become nearly instantaneous for future people accessing the same query on the dashboard that day.
These performance increases can significantly help against data sets of all sizes.
General Rule #2:
Snowflake’s functionality, such as scaling and caching, can significantly help with the performance of queries against data sets of all sizes.
The final thing to address is what extracts are. Extracts are data snapshots, meaning that, at a given point in time, a copy of the data is created and stored on Tableau Server. How extracts work needs to be thought about when deciding whether to use an extract.
First, creating an extract takes processing power. It takes a lot of computing power for the CPU and RAM on the server machine to make these extracts in an oversimplified sense. This usage could cause contention for resources when creating an extract.
The contention varies depending on Tableau Server’s environment configuration since you can essentially dedicate resources to create extracts. At a minimum, the contention from making one extract means it will be challenging to generate a second extract simultaneously. At worst, it could slow down performance for Tableau Server overall.
The second thing to address is that extracts take up storage space. The Tableau Server, like personal computers, has a limited amount of storage space. Although it generally has much more space, extracts can sometimes be 100s of gigabytes in size. They can take up all of the storage relatively quickly if you aren’t monitoring available space.
Third, since data extracts are snapshots, they require a refresh on a given schedule. The refresh itself is an intensive process that takes resources away from the server. It is important to distribute the size and number of extract refreshes at any given time.
Because of these resource considerations, extracts are great when they are curated. An extract should support multiple people or departments. For example, having a centralized Enterprise Sales Extract that various users and departments can access is more optimal use of resources.
General Rule #3:
Limit the number of extracts used on the Tableau Server as it is resource-intensive. Wherever possible, make sure extracts serve multiple dashboards.
It is worth noting that, in some cases, extracts subvert security.
All modern databases offer row-level security. Row-level security means that specific individuals can only access certain rows within a dataset based on their assigned roles. For example, if I am a sales leader, I may only see records for my sales region, but I may not be allowed to see my peers.
Security can also apply to individual tables or databases themselves.
If I make an extract and share it with my peers, they see what I see. An extract could circumvent security protocols put in place by the database administrators.
Yes, you can implement security within Tableau to mirror the permissions outlined above. However, Tableau was not built with security in mind. It is generally much harder to implement these security business rules within Tableau. It is almost always better to enforce these rules at the database level and require a live connection for protected data.
General Rule #4:
Protected data is best used with a live connection and not a data extract, as database permissions are best implemented at the database level and not the Tableau level.
So, should you use an extract or a live connection? Well, as we talked about, there isn’t a clear answer. Sometimes one will be better than the other. However, when using Snowflake, the potential advantages to using the live connection are numerous. Also, there are resource negatives to using extracts. Trying a live connection first is generally the best idea.
General Rule #5:
Use a live connection wherever possible, especially if it is to Snowflake. Extracts should be the backup for fringe scenarios.