In this blog, we’ll explore the clustering and search optimization features in the Snowflake Data Cloud and ultimately help you decide which one to select for a specific use case.
Snowflake has enabled many different use cases for consumers in Data Science, Analytics, Reporting, etc. They offer several features out of the box that helps customers reduce the amount of tuning required for optimal performance. Snowflake uses micro partitions to store data and the data within micro partitions are stored in columnar format.
Snowflake’s cloud services layer stores metadata information about these micro partitions in the metadata layer which will be used by queries in runtime. This includes information such as count, mix/max, and other information required for optimal query processing. Snowflake recently made the search optimization feature generally available which is designed to optimize point lookup queries.
Clustering Data in Snowflake
Snowflake usually loads data into micro partitions of a table in the order the data was loaded. This is called a naturally clustered table and it works very well for small to medium tables. However, there might be cases when the data needs to be reclustered on different keys when the query performance is not optimal. Snowflake recommends defining clustering keys only in the following scenarios:
- The query performance is not optimal and the data is queried consistently on a specific key.
- The table is of multi-terabyte size and a large amount of data/micro partitions are scanned.
What is Search Optimization in Snowflake?
Search optimization is a newer feature from Snowflake which improves the performance of Point lookup queries. This feature allows faster response for critical dashboard/reports which uses point lookup queries on high volume tables. Similar to clustering, the search optimization service runs in the background to collect metadata information about the columns and populates search access paths.
Key Differences Between Clustering and Search Optimization in Snowflake
Even though this looks similar to clustering, there are a few differences between clustering and search optimization:
- Clustering can be done on one key or a set of keys, whereas search optimization is enabled for all columns.
- Clustering adds additional compute cost to run background process for reclustering whereas search optimization adds both compute and storage cost. In some cases, the size of the search access path may be as big as the table itself, but in most cases, it will be one-third of the size.
- Clustering speeds up both range and equality searches but search optimization speeds up equality search only.
Clustering vs. Search Optimization Example
The following example shows how clustering and search optimization differ in query processing. Let’s create a table with 600M records from Snowflake sample data.
create table lineitem as select * from snowflake_sample_data.tpch_sf100.lineitem
Step 1: Clone the tables as below.
create table lineitem_clustered clone lineitem; create table lineitem_optimized clone lineitem;
Step 2: Enable clustering and search optimization on each of the tables.
alter table lineitem_clustered cluster by (l_shipdate); alter table lineitem_optimized add search optimization;
Note: Both clustering and search optimization are done by the Snowflake background process. So it will take a few minutes to a couple of hours for the process to complete.
Step 3: Verify clustering is complete.
select system$clustering_information('lineitem','(l_shipdate)'); select system$clustering_information('lineitem_clustered','l_shipdate');
Step 4: Verify search optimization is complete by checking the following fields in the result.
show tables like '%lineitem_optimized%';
Now let’s run point-lookup queries on each of these tables.
Step 5: Run select query on a table that has no clustering or optimization enabled.
The query profile below shows that the query profiler had to scan all the micro partitions to get the required result in six seconds.
select * from lineitem where l_orderkey='506754340';
Now let’s try the same query in the clustered table. This table is clustered on l_shipdate while we are querying on l_orderkey, so we can see that clustering does not affect this type of query.
select * from lineitem_clustered where l_orderkey='506754340';
Now when we rewrite the query to include cluster key as below, then the profiler took advantage of clustering and reduced the runtime by scanning on the required micro partitions.
select * from lineitem_clustered where l_shipdate between '1992-05-01' and '1992-07-15' and l_orderkey='506754340';
So even though clustering helps improve performance, it will not improve performance on point-lookup queries on all fields.
Finally now when we query the table with search optimization enabled, we can clearly see that the profiler is using search optimization effectively and returning results much faster.
select * from lineitem_optimized where l_orderkey='506754340';
Clustering is a great feature to cluster the micro partitions with one key or multiple keys. This feature helps reduce the number of micro partitions pruned during query processing and is generally more effective for analytical queries which continuously queries on a single or set of keys. However, this feature will not be helpful if the user is querying on a field, which is different from the field on which the table is clustered on.
Snowflake allows enabling both clustering and search optimization on a table. However, the decision on which one to choose or to choose both depends on the use case. While these features increase query performance, sufficient due diligence should be done to ensure the required feature needed for a specific use case. As both these services are background services, enabling these on high churn tables will significantly improve cost.