The most crucial step you need to perform before you begin to use the Snowflake Data Cloud as your data warehousing solution is to get your data loaded into Snowflake.
Data integration is essentially the Extract and Load portion of the Extract, Load, and Transform (ELT) process.
Data ingestion involves connecting your data sources, including databases, flat files, streaming data, etc, to your data warehouse. This crucial step needs to be flexible and reliable to keep your data up to date.
Snowflake provides native ways for data ingestion.
However, in this post, we’ll take you through why you should consider using a third-party data ingestion tool for Snowflake, what we consider to be the best tools in the industry, and some tips for using those tools.
Why Leverage a 3rd-Party Tool for Data Ingestion in Snowflake?
Data ingestion can be a cumbersome process, especially if your organization has many different data sources to load into Snowflake. Using native data ingestion tools from Snowflake can work well, but often requires extensive programming knowledge to develop reliable and reusable ingestion processes.
Third-party ingestion tools are designed for ease of use and often offer low code/no code solutions.
Some tools can be used to connect to sources that native Snowflake tools cannot (without the help of API calls), such as Salesforce. Other features include email notifications (to let you know if a job failed or is running long), job scheduling, orchestration to ensure your data gets to Snowflake when you want it, and of course, full automation of your complete data ingestion process.
There are many different third-party tools that work with Snowflake. To help you make your choice, here are the ones we consider to be the best.
What Are the Best Third-Party Data Ingestion Tools for Snowflake?
Fivetran is a tool dedicated to replicating applications, databases, events, and files into a high-performance data warehouse, such as Snowflake. The Fivetran team boasts that setup can take as little as five minutes to complete, allowing your team to move to Snowflake quickly and efficiently.
Additionally, the 300+ standardized connectors allow pipelines in Fivetran to be fully managed and require zero maintenance on the user’s part.
Fivetran is the answer for anyone looking to focus their efforts on analytics and not pipeline management. Snowflake also has a long-standing partnership with Fivetran, which allows for innovations on Fivetran’s part to keep up with new features in Snowflake.
Tips When Considering Fivetran:
- Although Fivetran can perform some basic transformations, it is best suited for pure replication. If more advanced transformations are required, it’s best to pair it with a transformation tool such as Data Build Tool (dbt).
- Fivetran works with all three Snowflake cloud providers.
- If using a network policy with Snowflake, be sure to add Fivetran’s IP address list, which will ensure
Azure Data Factory (ADF)
Azure Data Factory is a fully managed, serverless data integration service built by Microsoft. It includes an intuitive visual environment where users can create data ingestion pipelines with little to no code. ADF includes over 90 built-in connectors at no extra cost.
This is a great solution for those wanting to perform data ingestion into Snowflake who may already have other services with Azure. ADF includes many connectors that are not native to Snowflake. Not only can ADF handle data ingestion, but it can also be leveraged for transformations and the orchestration of pipelines.
Tips When Considering ADF:
- ADF will only write to Snowflake accounts that are based in Azure.
- ADF requires an Azure Blob Storage container with a Shared Access Signature configured as a staging area before copying into Snowflake.
- Source data formats can only be Parquer, JSON, or Delimited Text (CSV, TSV, etc.).
Streamsets Data Collector
StreamSets Data Collector Engine is an easy-to-use data pipeline engine for streaming, CDC, and batch ingestion from any source to any destination. Data Collector can run on-premises or in the cloud, wherever your data is.
Data Collector pipelines are designed to be re-used across different sources and destinations by changing just a few configuration settings. It also includes an alerting system to indicate when any data drift or outages occur within the pipeline.
Data Collector is a good solution for those looking for easy-to-use and easy-to-replicate pipelines that include a multitude of source connectors to move data to Snowflake. Data Collector also offers replication and Change Data Capture (CDC) to be able to accurately and efficiently get your data into Snowflake.
Tips When Considering Streamsets Data Collector:
- As a Snowflake partner, Streamsets includes very intricate documentation on using Data Collector with Snowflake, including this book you can read here.
- Data Collector can use Snowflake’s native Snowpipe in its pipelines. But, be careful the ‘Table Auto Create’ must be deselected — if it isn’t, then an exception is thrown at runtime or during pipeline validation.
Qlik Replicate is a data integration tool that supports a wide range of source and target endpoints with configuration and automation capabilities that can give your organization easy, high-performance access to the latest and most accurate data. Replicate can interact with a wide variety of databases, data warehouses, and data lakes (on-premise or based in the cloud).
It can also connect to many streaming services, ingest flat files, and has out-of-the-box connections to enterprise applications like SAP and Salesforce.
Replicate automatically generates target schemas in Snowflake and simplifies massive ingestion from thousands of sources with an intuitive GUI, eliminating the need for manual coding. This can provide the efficiency needed to spend more time developing analytic solutions instead of moving data.
Tips When Considering Qlik Replicate:
- When Replicate creates a new table in the target endpoint, it defines only one index on the table. The index will either be the Primary Key or the first Unique Key (according to alphabetical order) of the table. No other indexes will be defined in the target. If additional indexes are required, these will need to be defined manually.
- When replicating tables without a Primary Key, there is no way to verify whether a record already exists on the target. This may result in data inconsistency when UPDATE and DELETE operations are performed on the target database.
- Replication of calculated values is not supported during Change Processing.
Matillion is a complete ETL tool that integrates with an extensive list of pre-built data source connectors, loads data into cloud data environments such as Snowflake, and then performs transformations to make data consumable by analytics tools such as Tableau and PowerBI. Matllion can replicate data from sources such as APIs, applications, relational databases, files, and NoSQL databases.
What sets Matillion apart is that it can be deployed in the cloud (GCP, AWS, or Azure) or on-premises which can be helpful for users working in highly regulated industries.
For simple and quick replication to Snowflake, Matillion offers Data Loader, a SaaS tool that migrates data from various data sources. It’s a code-free wizard that provides basic administration capabilities like controlling access to pipelines, connections to data source endpoint, and cloud data warehouse. It also includes a metric dashboard to display information about data loads, and statistics around records uploaded.
Tips When Considering Matillion:
- Matillion has a limited number of native connectors, however, allows users to create their own connector to any REST API source system.
- Matillion is not a no-code solution, but rather a low-code solution. This allows users to utilize Python to customize transformations.
AWS Data Migration Service (DMS)
AWS DMS is a cloud-based service provided by Amazon that is specifically designed to migrate relational databases, NoSQL databases, data warehouses, and other types of data stores to the cloud.
DMS takes over many of the difficult and tedious tasks of migration including performing capacity analysis, the ability to scale up or down migration resources, automatic failover, conversion of a source schema, and much more.
Although DMS does not provide Snowflake as a direct target, DMS can migrate your data store to an AWS S3 bucket which can then be ingested into Snowflake with the help of AWS Glue and AWS Lambda.
For those looking to migrate to Snowflake who prefer using AWS services, DMS is a great solution. The advantage of being able to convert schemas from source to target automatically is a game changer to the time and cost of performing a migration to Snowflake.
Tips When Considering DMS:
- DMS requires an EC2 server to run, if you’re experiencing slower performance than expected, scale up the CPU in the EC2 instance.
- DMS defaults to replicating eight tables in parallel. This can be changed if using a smaller EC2 instance or increased if using a larger instance, but increasing it too much may negatively affect performance
- The full parallel load can be used to increase performance by partitioning tables and loading them in parallel. However, this can only be used on a few specific sources: (Microsoft SQL Server, Oracle, DB2, and MySQL)
There are many third-party data ingestion tools out there to choose from, and it can be daunting to go through them all. Depending on your company’s situation, one of the tools we mention in this article will get your data replicated into Snowflake quickly and efficiently, so you can focus on using that data to make business decisions.
If using a third-party application is not for you, be sure to check out our other blog, Most Popular Native Ingestion Methods for Snowflake.
If you need help with migrating to Snowflake, phData is happy to help! As the 2022 Snowflake Partner of the Year, phData thrives at helping enterprises of all sizes and industries swiftly move to Snowflake with minimal business disruptions. Discover how phData can accelerate your Snowflake journey today by exploring our expert Snowflake consulting services.
Truthfully, the hardest hurdle is usually the first one, which is getting the tool connected to your sources and Snowflake. Since every company has a slightly different setup (each with firewalls and network policies), sometimes this can be challenging to navigate while performing setup. Thankfully, the phData team has written extensive articles on how to connect tools to Snowflake, such as How to Write Data to Snowflake from Azure Data Factory.
The biggest reason is the ease of use. Snowpipe and other native methods aren’t as user-friendly, especially for companies that may not have experienced data engineers on staff. Most of these tools include a robust GUI which makes it much easier to create pipelines rather than writing the code from scratch. By the same token, these tools are created solely for data ingestion and will provide features such as validation and alerting that would be difficult to code from scratch.