On its own, the Snowflake Data Cloud is a powerful platform for fueling data-driven decisions. When paired with Fivetran, you’re looking at a dynamic combo for quick and reliable data access.
In this blog, we’re going to explore how to ingest custom data sources into Snowflake using Fivetran.
What Is Fivetran?
Fivetran is a fully managed, web-based tool that makes it very easy to build and manage data pipelines that move data from various sources into various destinations. This is done via their collection of built-in connectors, which includes several data sources and targets, from databases (e.g. SQL Server) to streaming platforms (e.g. Apache Kafka) to applications (e.g. Salesforce).
Fivetran removes a lot of the management and development that data engineering teams would traditionally need to do in order to integrate their data sources to their various destinations, which allows them to focus on more high-value work for the business.
What is a Custom Data Source?
Just as there are always exceptions to every rule, there are going to be data sources that have not yet been built into or natively supported by the Fivetran platform. These custom data sources will not be available as connectors; thus, the data from them cannot be easily ingested using a typical Fivetran pipeline creation workflow.
Challenges of Ingesting Custom Data Sources
Once a business begins using a tool like Fivetran, ingesting data from a custom source that is not currently built into Fivetran (via a connector) can often mean needing to create a separate, custom solution for ingesting that particular data (which has some distinct disadvantages).
It would be preferable if there were a way to still use Fivetran for this use case, even if it was not via a native connection to that particular source. In this blog, I would like to introduce more about why we should explore this, how it can be done, and walk through an example implementation using AWS Lambda.
Why Use Fivetran To Ingest Custom Data Sources?
As already mentioned, when a data source is encountered that cannot be handled natively in Fivetran, it is often going to be preferable to find a way to integrate that source with Fivetran as opposed to creating a separate pipeline just for that source.
Creating that separate pipeline would probably involve more time researching/building a solution, increase the scope of what the team has to maintain, and could introduce technical debt (all the things that Fivetran was meant to eliminate).
Fivetran also simplifies the implementation of a data pipeline’s non-functional requirements with its built-in alerting, security, and auditing.
Keep in mind that this is not always going to be the case. For example, if there is an already agreed upon solution/pattern that could be used for that ingestion pipeline or if the use of Fivetran has not been prioritized by your business. But in most cases, it will be more valuable for the business and easier for the engineering teams to use the existing pattern of Fivetran if possible.
How to Ingest Custom Data Sources using Fivetran
To this end, an integration must be created between your data source and an existing Fivetran connector. This is not dissimilar to common integration patterns in software engineering, where a separate technology/protocol will be used to bridge the gap between two applications (the data source that is not supported by Fivetran and Fivetran itself).
There will be several Fivetran connectors that can be used for this integration based on the type of data source. For instance, if your custom data source is a log collector, there will often be a way to dump those logs into a queuing technology like Apache Kafka or AWS Kinesis, which can then be natively consumed by Fivetran.
Another example could be an API for a particular third-party application that contains data you are interested in, which does not have a native connector to Fivetran. Here a function-based connector might be helpful to bridge the gap between querying the API and integrating with Fivetran. One of those function-based connectors is AWS Lambda, let’s explore how to set that up.
Setting up a Lambda Function To Ingest a Custom Data Source
Fivetran has a very detailed tutorial for integrating Lambda into a pipeline as a connector, which is a similar process I will follow here. However, I will be walking you through a simplification of that process and additionally setting up the pipeline to ingest to Snowflake.
NOTE: This tutorial assumes some very basic understanding of the AWS console and Fivetran connector setup.
Below is an overview of the process and the components that will need to be involved or created:
The steps to set this up will be:
- Create a destination configuration in Fivetran (Snowflake)
- Obtain the Fivetran External ID (Group ID)
- Create the AWS role
- Create the AWS policy
- Create the Lambda function
- Complete the Fivetran Connection setup
Step 1: Create a Destination Configuration in Fivetran (Snowflake)
Log into your Fivetran dashboard and click on the Add Destination button.
Name your destination and choose Snowflake as the destination type:
Step 2: Find the Fivetran External ID
Find your Fivetran External ID (or Group ID), this can be found after setting up your destination by opening up your destination and adding a connector:
Select AWS Lambda as the connector:
Step 3: Create AWS Policy
Step 4: Create AWS Role
Create an AWS role with the trusted entity type as an AWS account. We will override this part later so don’t be too concerned about filling out this screen completely.
Name and create the role, then open the newly created role, open the Trust Relationships section, and edit the trust policy. Overwrite the policy with the following JSON:
Notice in the JSON above that the Principle entity contains the Fivetran AWS account ID, which will be a static constant for all trust policies involving Fivetran. The only thing that will need to be edited in the JSON will be to overwrite “your_fivetran_externalID” with the Fivetran external ID you gathered from the Lambda connector setup earlier in step 2.
Below is an example of a completed trust policy in the AWS console:
Step 5: Create Lambda Function
Create a lambda function, and here you will need to provide the code that is needed for retrieving data from your particular source location. If you have not written any code yet, Fivetran does provide some sample functions that can be copied into your function for testing purposes.
The only other requirement for the creation of this Lambda function is to make sure to use the role that was created earlier as the function’s execution role. This is located under the Permissions > “Change default execution role” expandable section in the Lambda setup screen, as shown below:
After creating the Lambda function, the default timeout value of a Lambda function’s execution is set to be three seconds, which depending on your expected execution time, will most likely not be long enough and cause errors in your pipeline.
We recommend setting it to at least five minutes to create a buffer for any abnormalities in your execution times. Use your judgment based on the code you’ve written. This setting is located under the Configuration section if you open up the Lambda function on the AWS console:
Step 6: Complete Fivetran Connector Setup
Back in Fivetran, complete the connector setup by gathering the ARN of the AWS role that was created earlier in step 4 and the name of the Lambda function created in step 5. You can also include Secrets, which is a JSON object that can pass secret values to the Lambda function such as for an API key. More information on the secrets JSON object can be found here.
NOTE: This guide allows for using the “Sync Directly” method, but there is an option to sync through an S3 bucket as well, which requires some alteration to the above setup.
That’s it! Once the above is saved and tested (assuming it does so successfully) you will have a working pipeline for ingesting data from a data source that is not natively supported by Fivetran using the AWS Lambda service into a Snowflake database.
There are going to be times when a business using Fivetran will run across data sources that cannot be natively ingested by a Fivetran connector. When this happens, the solution is generally not going to be to abandon Fivetran and build some separate, custom pipeline.
Instead, take advantage of one of Fivetran’s existing connectors to build a bridge between your custom source and Fivetran.
One of those “bridges” is AWS Lambda, and hopefully, we have shown how easy it is to set it up to ingest from whatever data sources your coding skills allow! However, do remember that with great power comes great responsibility, and although it would be easy to use AWS Lambda to ingest all types of data sources, it might not always be the best choice.
If you find yourself having difficulty deciding what the right choice is or need help with data ingestion, Fivetran, or Snowflake in general, feel free to reach out to phData today to hear how we can accelerate your success!