A common problem solved by phData is the migration from an existing data platform to the Snowflake Data Cloud, in the best possible manner.
With numerous approaches and patterns to consider, items and processes to document, target states to plan and architect, all while keeping your current day-to-day processes and business decisions operating smoothly—we understand that migrating an entire data platform is no small task.
We have experience migrating many different types of systems to Snowflake. We plan our Snowflake migrations to account for your existing operational activities, so your business can go on as usual.
This post will cover our planning strategies and tips for a smooth Snowflake migration. Since each migration is unique and presents its own set of challenges, the following only illustrates our foundational framework for migration.
For a more detailed view of our migration process, take a look at one of our migration guides.
Take an Inventory
Taking an inventory is an important step for the following reasons;
- It informs the scope of a Snowflake migration.
- It helps identify the required objects.
- It’s useful in describing the activity and size of the data.
Understanding what is in scope to be migrated, whether lift and shift or rearchitecting, is a critical step. Below are a few of the items that need to be taken into account.
The sources involved could influence or determine the options available for the data ingestion tool(s). These could include other databases, data lakes, SaaS applications (e.g. Salesforce), Access databases, SharePoint, or Excel spreadsheets.
Not only do we need to know which sources are in scope, but we also need to identify the owners of said sources. Those owners will have a deeper level of knowledge of their systems, likely facilitate access to the data objects, and be a key partner in extracting data.
An efficient way to think about destinations is by identifying dependencies. This is an area that is easily overlooked but is essential to keeping a business functioning without interruption.
Data flows from the current data platform to the destination. The said destination could be a reverse ETL pattern for an operational system, a data lake for machine learning or data science, or an extract process to copy data to an Access database for end users—the goal is to identify if these destinations will need support from the Snowflake platform.
Similar to sources, identifying the owners is an important component. The destination owners will help to ensure:
- Data is in the correct format and structure for the business consumers.
- The necessary access is granted so data flows without issue.
- Any additional requirements and service level agreements are satisfied.
These include tables, schemas, databases, stored procedures, and jobs (e.g. SQL Server Agent jobs). Oftentimes inventorizing database objects will uncover schemas, tables, etc., that aren’t in use anymore or are being used for performance reasons due to current system constraints. This process can help reduce the number of objects needed in the scope of the new Snowflake environment.
Another component is gathering statistics from the database, such as activities, table sizes, and the volume of changes in a given day. This helps us size the Snowflake warehouses, so performance is tuned correctly.
Ingestion Pipelines and Jobs
This particular component of the inventory can be used to create a mapping of database objects and help determine how they are loaded. It also enables us to understand where and how/if data is combined during the ingestion pipelines.
Similar to the database objects, we gather information about the volume of data being processed, the frequency of the pipelines, and the types of activities performed (e.g. incremental updates) to better size our Snowflake warehouses. Depending on the tool(s) used, some of these specific numbers won’t be readily available, but that’s okay (an estimate or approximation still helps with the sizing).
Transformations can be a part of data ingestion (ETL pattern) or can take place at a later stage after data has been landed (ELT pattern). Either way, it’s important to understand what data is transformed, and how so.
More often than not, the SQL code used to perform the transformation won’t be able to run as-is from the current system to Snowflake. This is where the phData Toolkit will come in handy. Our SQL Translation (previously SQLMorph) tool can convert a variety of SQL dialects into Snowflake-specific SQL.
Determine the Approach
Once we have an understanding of the current state and an idea of the scope of the migration, we plan for the approach that will lead us to the target state. It usually involves asking questions about whether the migration will be a lift and shift – keeping the target state as close as possible to the source state, but on Snowflake – or if the ingestion and transformation patterns should be rearchitected to remove complexity and technical debt.
Either approach is valid, but they have different steps and strategies, so it’s important to plan and choose the appropriate approach before the actual Snowflake migration begins.
Below are some considerations for the aforementioned approaches:
Lift and Shift vs. Rearchitecting/Redesigning
A lift and shift approach includes largely leaving the pipelines and jobs as-is while redirecting the work and output to Snowflake. Because of that, this approach has the benefits of both getting to Snowflake in a quicker manner and allowing for a true “apples-to-apples” comparison.
The rearchitecting approach attempts to remove or reduce complexities in the pipelines, thereby optimizing for processes on Snowflake, and even using an alternate data model to further unlock the data’s potential. This approach is a longer-term view of the platform as it allows planning for the incorporation of future use cases and how to optimally use Snowflake.
When it comes time to decide which approach is right for you, we have successfully performed both types of migrations and have identified some things to consider.
The lift and shift approach is recommended for migrations with a hard date deadline (e.g. end-of-life hardware) or with requirements that data must match (e.g. have the exact same transformation logic applied) between systems due to compliance reasons. In this case, we recommend checking out SQL Translation in our phData toolkit. It allows for easy conversion of SQL between dialects without changing any logic contained within.
If the lift and shift considerations don’t apply, then we recommend the rearchitecting approach. This allows you to potentially build out a new data model, reduce any complexity from the existing architecture, and plan for the long-term future use cases.
As we just mentioned in the lift and shift approach, the pipelines are mostly kept intact and redirected to Snowflake. Depending on what is being performed in the pipeline and how the tool handles the processing, some reworking may be needed to change the source SQL dialect to Snowflake-specific SQL.
On the other hand, rearchitecting the pipelines may involve decoupling the ingestion and transformation processes and simplifying the various workflows.
Our recommendation on ingestion pipelines is to follow an ELT approach, where raw data is landed into Snowflake to ensure that it closely matches the source layer – with data type conversions being the only difference.
Our partners at Fivetran have an excellent understanding of this approach. Between their HVR technology that provides robust, near-real-time data ingestion and their SaaS offering with a wide variety of easy-to-use connectors, they have really streamlined the data ingestion process so you can focus on the value-add for your business.
This stage is where transformations start in Snowflake.
Part of the inventory exercise is to understand where transformations are executed. Here we determine how these actions should be migrated, whether by decoupling the transformations from the data ingestion (ETL vs. ELT) or redirecting an existing job to Snowflake (e.g. dbt).
Many of the successful migrations we have been part of have involved our partners at dbt or Matillion (and sometimes both). dbt is a code-first, SQL-based technology that focuses on transforming your data and allows you to incorporate software engineering best practices into your analytics.
Matillion provides a cloud-native, GUI-first approach to building out transformation pipelines. Both technologies provide excellent ways to create transformation pipelines, ensuring your business is getting trustworthy and actionable data.
When it comes to the approach for reporting, there are a few points to keep in mind. Part of the decision on the approach will be influenced by the technology involved. Some tools provide more flexibility in changing connections in reports and thus, allow for a lift and shift approach.
Another point to consider is the end data model and if it will differ from the current structure. A different end model will necessitate the rebuilding of existing reports to leverage the new design.
Let’s talk about the different models and approaches used to structure data in the new Snowflake platform.
Choosing the warehousing architecture goes beyond the migration; it supports and enables the business. These approaches have both benefits and drawbacks and the decision is usually made based on the business requirements and the value that each approach may provide.
Below are the data models that can be used:
- Dimensional modeling with facts/dimensions (e.g. Kimball, Inmon)
- Data vault methodology
Create the Information Architecture
The Information Architecture document provides pertinent details for organizing, structuring, labeling, securing, and sharing the Snowflake environment. Based on our experience, being explicit with the architecture design ensures a successful migration to the target state.
The Information Architecture is intended to be a living document. It’s revisited periodically to account for the change in requirements, the addition of different sources to the Snowflake Data Warehouse, and as business needs evolve. This helps maintain a clear view of the design and intent of the platform.
Ready to Get Started?
Getting started and establishing a solid understanding of the effort and scope is oftentimes the toughest part. A data strategy is an excellent place to start and can help to accelerate your migration.
Great question - there are a lot of moving parts in a migration. That’s why we’ve developed our phData toolkit to help automate certain tasks and activities. Check it out and contact us to chat about your migration!