Combining Alteryx and the Snowflake Data Cloud is a match made in data heaven. Alteryx provides an intuitive, low-code platform for data integration and transformation, while Snowflake delivers a powerful processing engine and storage layer for your consumer applications.
However, while Alteryx has a fixed cost, Snowflake operates on a pay-as-you-go model. To maximize the benefits of this dynamic duo, it’s essential to understand how Snowflake bills you for its services.
Snowflake charges you in three primary ways: Compute, Storage, and Data Transfer.
Fortunately, Alteryx users don’t need to worry about Data Transfer costs, as it uses an ODBC driver to interact with Snowflake, with no associated charges. However, keeping an eye on Compute and Storage is crucial for controlling and optimizing your budget.
In this blog post, we’ll share five essential tips that will enable you to unlock the full potential of this dynamic duo in the most cost-effective way possible.
Tip #1: Set an Appropriate Virtual Warehouse for your Snowflake Connection in Alteryx
Connecting to Snowflake through Alteryx is easy, but to optimize your usage, it’s essential to set up your virtual warehouse properly. After installing the Snowflake ODBC driver, you can either create a DSN connection or a DSN-less connection string.
Regardless of the method, you can set several parameters, some of them are optional, including the warehouse used to process data in Snowflake. If you don’t define a virtual warehouse, Alteryx will use the default warehouse for your user, which may not be the most efficient option.
For instance, your default medium-sized warehouse might consume four times more credits than a X-Small one, while only reducing your query processing time in less than 2x, especially if you work with relatively small datasets and simple transformations. This scenario is not ideal, as you end up spending more money than necessary for processing your data.
Therefore, it’s always a good practice to evaluate your options and check the burned credit difference to make informed decisions. By doing so, you can set up a virtual warehouse that best fits your needs, saves costs, and improves query processing time.
To avoid this scenario, it’s advisable to create a separate virtual warehouse in Snowflake exclusively for Alteryx. You can set specific configuration that fit your use cases, including the appropriate size and parameters. This approach gives you more control over the compute resources you’re using and helps to optimize your budget.
To set up a warehouse in a Snowflake DSN connection, just set the “Warehouse” parameter in the configuration dialog:
For a DSN-less connection string, use this example:
* Your role must have USAGE grants on the warehouse provided, otherwise you will run into an error when using the connection in Alteryx.
Tip #2: Create Transient Tables with no Retention Period when Outputting data to Snowflake
In the pursuit of optimizing costs for Alteryx users utilizing Snowflake, it is crucial to consider the persistence and retention periods of tables created during the process. By default, Alteryx generates permanent tables when creating or overwriting tables in Snowflake. While these tables have built-in backup features such as Time Travel and Fail-Safe, their necessity is dependent on the use case.
For instance, data ingestion processes, such as those run by Alteryx workflows, can typically recreate data every time they are triggered, rendering extensive backup features unnecessary. It is important to note that generating fail-safe storage costs for data that may never be used again can significantly impact the overall bill for multiple users and workflows.
Snowflake provides different retention periods for various table types. If backup features are not required, using transient tables is recommended. In fact, transient tables can be created with no retention period by setting Time Travel to 0, resulting in no backup costs.
Two methods can be employed to create transient tables in Alteryx:
1. Create a transient database or schema via Snowflake. This automatically makes all tables created under this structure transient.
2. Use the Pre-SQL option in the Output Data tool to run DDL commands – This one is a bit tricky because you need to know the metadata of your table to pass it to the “Create Table” command.
It is essential to note that this approach works with the “Delete Data & Append” output option, where the table is created as a transient table, and all data is deleted and replaced by the data coming in through the Output Data tool.
Tip #3: Utilize Resource Monitors to Effectively Manage Your Alteryx Dedicated Warehouse Credits
Resource monitors are a valuable feature available in Snowflake that enables users to efficiently manage the allocation and usage of resources within their Snowflake account. This feature can be particularly helpful for tracking warehouse usage and enforcing usage limits.
Once you have implemented Tip #1, the best way to ensure that you are staying within budget with your Alteryx + Snowflake utilization is by utilizing resource monitors.
By creating resource monitors in Snowflake and assigning them to specific warehouses, users can effectively control the level of resources that Alteryx workflows are utilizing over a specific period of time. This can be achieved through creating notification alerts, or by suspending the warehouses if a specific credit quota percentage is reached.
To create a resource monitor in Snowflake, you will need an ACCOUNTADMIN role and specify certain parameters.
After creating the resource monitor, it can be assigned to your dedicated Alteryx warehouse.
The resource monitor will guide the actions that Snowflake can take if the ALTERYX_VW warehouse hits a percentage of the credit quota. For instance, if it reaches 75 percent, a notification will be sent to account administrators with notifications enabled. If it hits 100 percent, the warehouse will be suspended after the query is finished. If it hits 110 percent, the warehouse will be immediately suspended, interrupting any ongoing queries.
By utilizing the “SHOW RESOURCE MONITORS” command, you can track the number of credits used since monitoring began. This command can be performed by any role with the “MONITOR” privilege assigned.
There are additional parameters that can be set to a resource monitor, and the example provided in this article is only one possibility. For further information, it is recommended to refer to Snowflake’s official documentation.
Tip #4: Utilize Multiple Caching Features in Snowflake and Alteryx
Caching can not only enhance performance but also reduce compute costs when used efficiently in Alteryx. Snowflake offers three types of caching:
- Metadata Cache: Snowflake automatically collects and manages metadata about tables and micro-partitions, including the following:
- Row Count
- Min/Max Values
- Number of distinct values
- Null Count
To retrieve such statistics, instead of running a query like “SELECT * FROM Table” and using a Summarize tool, you can use “SELECT MIN(Col1), MAX(Col2), COUNT(*) FROM TableA”.
This query does not consume warehouse resources as it accesses the metadata cache. Note that these results will depend also on the data types. For example, INT values will have MIN/MAX stored in the metadata cache, but FLOAT values won’t.
- Results Cache: Snowflake stores the result of a query when executed. When the same query is executed within 24 hours, Snowflake retrieves the result from the query result cache instead of the storage layer, avoiding warehouse resource usage. This type of cache holds some constraints that need to be respected:
- The query can’t contain UDFs or functions like CURRENT_DATE
- Underlying data should have not changed since last execution.
- Query syntax needs to be the same.
This type of cache is particularly useful if your team frequently queries the same dataset, and this dataset doesn’t change often. By standardizing the queries, you can avoid having every team member create their own queries with specific syntax or use different methods and tools to process the data in Alteryx. This can help your team make use of the Results Cache and reduce Compute costs.
- Virtual Warehouse Cache: If your workload is heavy and regular, there are cases that it’s better to keep a warehouse active than suspended. An active warehouse caches raw data in its own disk. It’s able to retrieve data faster, because it won’t need to go all the way to Snowflake’s Storage layer. There are situations where users can take advantage of that. Having a lower processing time means you are able to process more queries in less time. If you know there’s a window in your day that Alteryx will perform multiple queries to Snowflake (you can analyze schedule times with Alteryx Server, for example), you can program your Warehouse to be active for that period and suspend it after your last job in the window finishes.
The virtual warehouse can be suspended manually using a Post-SQL statement:
- The role defined in Alteryx needs GRANT privileges on the warehouse selected to successfully perform the statement
Alteryx does also have an interesting caching feature, especially if you are in the process of workflow development, and you know during that time Snowflake data won’t change. If you right-click the Input Data tool, you will see the option “Cache and Run Workflow”.
Once you select it, the data in Snowflake will be saved in a temporary file in your local computer, and the next time you run the workflow, Alteryx won’t query Snowflake, and use the file instead. This obviously reduces your Compute costs, as after the first run, Alteryx will load the local copy of the data.
Tip #5: Optimize your Snowflake Warehouse for Bulk Loading Data in Alteryx
When it comes to bulk loading data into Snowflake, there are different considerations to keep in mind compared to querying data. The process of bulk loading data involves two steps: first, Alteryx splits the dataset into file chunks and uses the PUT command to load the data into a Snowflake stage, with Alteryx’s engine handling all the compute demands.
Second, Snowflake takes these files and loads them into a table using the COPY INTO command, which is executed on a virtual warehouse.
In many scenarios, you won’t need to scale up your warehouse to load data into Snowflake. While adding more compute size might seem like a logical solution to improve performance, Snowflake’s compute parallelism is optimized to handle loading multiple files at once, so the gains from scaling up the warehouse might not be worth the credits spent. An X-Small warehouse is often sufficient for datasets ranging from 1-100 GB in size.
Another useful tip is to test different file chunk sizes. While Snowflake typically recommends splitting files into chunks of 100-250 MB after compression, it may be worth experimenting with smaller chunk sizes ranging from 10-100 MB for smaller to medium-sized datasets. Alteryx defaults to a chunk size of 64 MB but fine-tuning this value may yield better performance results than simply increasing the warehouse size.
If you’re bulk loading data into Alteryx, it’s always worth creating a separate connection and warehouse specifically for this purpose. A smaller warehouse may be sufficient for your needs, particularly if you’re outputting aggregated data optimized in your workflow for analytics purposes. By keeping these tips in mind, you can optimize your Snowflake warehouse and achieve good performance with a lower cost for bulk loading data.
In conclusion, optimizing your Snowflake costs when using Alteryx is essential to maximize your return on investment and helps to gain a competitive edge in the marketplace. By following the five essential tips outlined in this post, you can ensure that your data integration and management processes are efficient and cost-effective.
At phData, we specialize in helping companies optimize their data strategy by leveraging the power of platforms such as Alteryx and Snowflake. Our team of experts can work with you to develop a tailored approach that meets your unique needs and delivers maximum ROI. Contact us today to learn more about how we can help you achieve your data management goals.