March 1, 2022

How to Choose the Right Backup and Disaster Recovery Plan for Snowflake

By Meesum Ali

For most organizations in 2022, databases are crucial to businesses performing their day-to-day business operations as they store information on product inventory, customer purchases, and employee details among other critical metrics. 

Without this information, businesses may not know important details such as a customer’s account balance, purchase history, or login credentials — preventing them from conducting their business. 

With that being said, it is all the more important that every organization have a backup and disaster recovery plan just in case their databases go down. 

The Snowflake Data Cloud has several proposed solutions to disaster recovery with their services of:

  • Time Travel
  • Fail-Safe
  • Data Replication and Failover

All of these services tackle the problem of disaster recovery using a slightly different approach. In this blog, we’ll cover each of them in great detail. 

Disaster Recovery Metrics

When it comes to something as critical as disaster recovery, it can be helpful to have some terms by which you can quantify the acceptable level of downtime and loss for various applications. Two metrics that allow you to do that are Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

  • Recovery Point Objective can be thought of as how far back in time you have an available backup for the data. It also establishes how much data you are willing to lose.
  • Recovery Time Objective can be thought of as the amount of time it takes from the incident occurring to returning to normal operations as a business.

To apply RPO to an application, you would need to define how often you are taking backups of the data. For example, if you had an application that handles customer transactions, you may not be willing to lose any data and would need redundant systems to have a replica of financial transactions.

Whereas for another application, you might be willing to take backups every few hours and could afford to lose whatever data was produced in between. So for the customer transactions, we would have an RPO of zero.  For the other application, we would have one of a few hours. Different applications would have different RPOs depending on how much data you would like to preserve.

Applying RTO to an application would mean that you would need to establish what the acceptable amount of time for an application to be restored after a disaster. If we go back to the example of an application that edits customer transactions, we would want that to be restored instantaneously as opposed to a website FAQ section where a few hours of downtime could be afforded. Every moment of downtime degrades customer trust and business profitability.

As it relates to Snowflake and its disaster recovery services, RTO and RPO will help you define the retention period for which you would like to store data and which services you would like to use for disaster recovery (some provide better RPOs than others). 

What is Snowflake’s Time Travel Feature?

Much like the name suggests, Snowflake’s Time Travel feature allows you to access data within a database as it was at a previous point in time. This is achieved by Snowflake taking snapshots of the data when it is modified. Various operations can be performed on the historical data including:

  • Querying data that has been updated or deleted
  • Cloning historical data
  • Restoration of tables, schemas, and databases that have been removed (accidentally or otherwise)

Data Restoration

Data restoration through Time Travel can occur within seconds, which is ideal for avoiding large downtimes. Time Travel can also come in clutch if someone in your organization accidentally drops a critical table, which is usually a very difficult scenario to fix.

With Time Travel, this can be easily addressed by un-dropping the table. Another use case is if the data was corrupted by an invalid insert, the data can be easily fixed by reverting to the data before the load took place.

Data Retention

Time Travel comes with a retention period that can be thought of as the number of days in the past where it is still possible for data to be recovered. Although it may be useful to have a large retention period, the greater the retention period the greater the storage cost of Time Travel. Therefore it’s best to find a retention period that fits within your budget. 

The maximum retention period varies between editions of Snowflake. Standard edition allows for a maximum retention period of one day and with enterprise (or higher editions), it’s increased up to 90 days.

What is the Fail-Safe Feature in Snowflake?

Fail-Safe is similar to Time Travel in that it can be used to recover historical data but it differs in the fact that it is only intended for operational failures or disasters in which the data can no longer be recovered by other methods.

In addition, only the Snowflake team can conduct the recovery process through Fail-Safe storage. It’s best to think of Fail-Safe as the last option to recover your data when all else fails.

Historical data is stored under Fail-Safe for seven days and can’t be configured to another duration. It should be noted that the data will also incur storage costs. Once the seven-day period has lapsed, data can’t be recovered (even by Snowflake).

In the instance that a disaster does occur and Time Travel is not an option, it’s best to get in contact with Snowflake within the seven-day period of Fail-Safe. Once the Snowflake team is contacted, it can take anywhere from a few hours to days to recover historical data through Fail-Safe.

A diagram covering the time travel and fail stages in Snowflakeafe

Snowflake’s Data Replication & Failover

Snowflake’s Data Replication is a process by which a database is designated as a primary database and all DDL operations performed on that database are copied periodically to a secondary database in a different availability zone, region, or cloud provider. 

In addition to data replication, it is also possible to implement Failover to the secondary database. What this means is that if the primary database undergoes a disaster scenario such as a hardware failure, the secondary database can serve as a backup to perform database operations, avoiding any downtime.

By default, Snowflake replicates data across multiple availability zones within a geographical region and performs Failover to another availability zone in case of a disaster scenario at no additional cost.

Although Snowflake offers resilience to disaster at an availability zone level, disaster scenarios may affect the entire region and result in downtime for the database. Snowflake also offers database replication and Failover at a cross-region level, which is ideal for organizations wanting to avoid disasters at a regional scale. 

What this would mean is that your database would be replicated from one region to another. For example, a customer using Snowflake on AWS could replicate from US-East to EU-West. In this case, even if you lost access to all of your data in the US-East region, you would still be able to access your data by failing over to the secondary database in the EU-West region.

Snowflake also has a client redirect URL which can be set to redirect requests against a given database in a Snowflake account and have those requests be sent to a different database in another Snowflake account. This can be useful for redirecting requests to a secondary to a backup database.

Disaster Metrics Applied to Data Replication & Failover

As mentioned above, disaster metrics such as Recovery Time Objective (RTO) and Recovery Point Objective (RPO) can be helpful in doing a quantitative analysis of a disaster and recovery plan. Now that we understand how Data Replication and Failover work, let’s take a look at how RTO and RPO can be used to measure their effectiveness. 

In Data Replication, the frequency at which data is synced from the primary to the secondary database can be configured. What this means in the context of RPO is that any RPO level from near real-time to a few hours can be set depending on how critical the data is and the amount of data that is acceptable to be lost as a result of delays between syncs. 

Failovers occur instantly if the functionality has been configured on a primary and secondary database. When it comes to RTO, instant failover would mean that there is no downtime in which business operations are not able to be performed as a result of a disaster scenario. However, there may be a latency delay if the secondary database is in a region further away from the user base.

How Are Costs Calculated with Snowflake’s Data Replication & Failover?

If data replication and Failover are desired solutions to disaster recovery, it’s important to understand how their costs are calculated to see if they are feasible options for your business. 

Snowflake uses a pay-per-use model for their costs and the same applies to data replication and Failover. What this would mean with regards to data replication is that both the primary and secondary database would incur storage costs and there would be data transfer costs associated with copying data from the primary database to the secondary database. 

Cost can also be thought to be proportional to the frequency at which the data is being copied from the primary database to the secondary database. A higher frequency would mean more up-to-date data for purposes of backup but would have a higher cost due to more frequent data transfers. If you would like to set up cross-region Failover then Business Critical or higher editions of Snowflake are required for your Snowflake account. 

Have Snowflake Questions?

We’ve got answers, tips, advice, best practices, and much more! As an Elite Snowflake partner, phData specializes in helping enterprises of all sizes succeed with Snowflake through migrations, data engineering services, data strategy, managed services, and more. Reach out today to learn more.   

Interested in exploring how phData helped a client select and automate their disaster and recovery plan in Snowflake? Check out the case study today!

Common Questions about Snowflake’s Disaster Plan:

As of now, Snowflake only supports data replication for database objects. This includes tables, views, and stored procedures (among others). Snowflake objects that are not contained within a database such as Users, Roles, Warehouses, Resource monitors, and Shares cannot be replicated to a secondary region by Snowflake. What this means for the user is that they are responsible for creating these objects if they would like to achieve similar functionality in the secondary region.

Databases, schemas, and tables can be recovered by Time Travel and Fail-Safe. However, when it comes to tables in Snowflake, support for Time Travel or Fail-Safe might not be offered depending on the type of table. Of the four kinds of tables (temporary, transient, permanent, and external) only permanent tables can leverage both Time Travel and Fail-Safe. Temporary and transient tables can leverage Time Travel for up to one day but are not able to setup Fail-Safe. External tables can’t use either Time Travel or Fail-Safe as the data is not stored in Snowflake.

It depends on the type of table. For permanent tables, Fail-Safe is set to a seven day period and can not be disabled. Fail-Safe is meant to be a way to recover the data if all other methods fail. For other types of tables such as temporary or transient tables, it is not possible to enable Fail-Safe on the table and it is set to disabled by default.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit