While there are numerous data integration tools available in the market for data ingestion into Snowflake, including many third-party solutions found in the Snowflake Marketplace, Snowflake’s acquisition of Datavolo and the subsequent creation of Snowflake Openflow offer a native solution.
Powered by Apache NiFi, Openflow is capable of loading different types of data (structured, semi-structured, and unstructured) from various sources into Snowflake using both standard and custom connectors.
In this blog, we will discuss:
What is Openflow?
How to Ingest data from PostgreSQL to Snowflake with Openflow.
Key features available for the Openflow PostgreSQL connector and native components.
Important observations that caught our attention.
What is Openflow?
Openflow is a data integration platform created by Snowflake. It was built on technology acquired from Datavolo and is powered by Apache NiFi, a widely used data integration tool across many organizations.
Openflow automates and manages data flows between various enterprise on-premise, cloud, and hybrid data sources, enhancing Snowflake’s offerings at the bronze/landing layer for data engineering workloads. It also provides capabilities for ETL (Extract, Transform, Load) and Reverse ETL processes.
What we like most about Openflow is that it simplifies data ingestion from multiple sources and accelerates Snowflake customers’ success by eliminating the need for third-party ingestion tools, enabling quick prototyping, and supporting reusable data pipelines.
How to Ingest data from PostgreSQL to Snowflake with Openflow
Features
Snowflake has developed a reusable connector in Openflow for ingesting data from a PostgreSQL equivalent to an Always-On Pipeline for real-time data ingestion. Below are some of the key features of the Openflow PostgreSQL connector:
Can be used for bulk data loading and CDC replication. Bulk data loading (A.K.A: initial data load, milestone data load, etc.) is referred to as Snapshot in the connector, and the CDC replication is referred to as Incremental connector.
Uses Snowpipe Streaming to load data into Snowflake.
Works on a list of tables as well as table name patterns.
Can track DDL changes (
ALTER
) to existing tables and replicate the change downstream along with the data.Persists the changed data (
INSERT
,UPDATE
,DELETE
,TRUNCATE
) in separate “Journal” tables along with reflecting the changes in the main table.
Pre-Requisites
Below are a few of the most important, but not exhaustive, list of prerequisites required to start using the connector:
For an on-premises PostgreSQL database, set the wal_level to
logical
. For AWS RDS (which we will be using), change the rds.logical_replication to1
. To verify whether it has been implemented properly or not, we can run the following command:
Create a publication in PostgreSQL before replication starts, as the connector requires it. Publications can be created for all tables, a subset of tables and/or with a subset of DML operations (
INSERT
,UPDATE
,DELETE
,TRUNCATE
) to capture. Publications can also be modified later while the connector is running. Below is a sample DDL to create a publication:
CREATE PUBLICATION bulk_test_publication
FOR TABLE public.bulk_test1, public.bulk_test2,public.bulk_test3
Set up the Snowflake Database, Schema, Role, and Warehouse to use as the target.
Setup and Usage
Openflow has built a connector definition file in the form of a JSON
file. Below are some of the high-level steps involved in setting up the PostgreSQL connector for ingesting data into Snowflake using Openflow:
Import the connector definition file as a Process Group. This will be the starting point of the flow.
Set up the values for the parameters in the Parameter Context for this process group. These parameters connect to PostgreSQL and Snowflake and specify the list of tables we want to ingest.
Upload the PostgreSQL driver using Manage Assets under the Source Database Parameters in Parameter Context for this process group.
Click Start after right-clicking on the process group to initiate ingesting the data from PostgreSQL to Snowflake.
Below is a video that shows you how to set up the connector as a flow and ingest the data into Snowflake.
How to use the Native Components to Ingest Data From PostgreSQL into Snowflake?
You might be wondering why you’d want to use the Native Components from Ni-Fi?
There are a myriad of reasons why you’d use the Native Components, but the best reason is to add additional functionality to the existing workflows, ensuring that they meet any specific requirements you may have.
Features
We can also use the Native Components to do the data ingestion into Snowflake. Below are the key features for this:
Works on a list of tables as well as table name patterns.
Processors are available to run parameterized SQL statements using flow file attributes.
Multiple tables can be loaded iteratively in parallel.
Watermark column-based incremental loading is supported without the explicit maintenance of an audit table to store the latest value of the watermark column
Flow
Below is an example of a flow inside Openflow to process data from PostgreSQL to Snowflake:
Setup and Usage
Openflow also has all the native processors, which we can use for this use case. Below are some of the high-level steps involved in setting up the native connector for ingesting data into Snowflake using Openflow:
Create a Parameter Context similar to the one used in the connector with all the connectivity parameters required for PostgreSQL and Snowflake.
Upload the PostgreSQL driver using Manage Assets under the Source Database Parameters in Parameter Context for this process group. If this has already been done for the Connector, we can reuse it, and this step is not required.
Create a Process Group with the Parameter Context created in the above step.
Add Components to get the list of tables required for ingestion.
Add Components to build the ingestion queries to be run on the source database based on the table name, watermark column, and the batch size.
Add Components to run the ingestion queries to load the data as flow files.
Add Components to load the data into Snowflake as per the batch size.
Click on Start after right-clicking on the process group to start ingesting the data from PostgreSQL to Snowflake.
Below is a video to show how to build a flow inside Openflow utilizing the native components to ingest data:
Things to Remember
Below are some of the most important things to remember for using Openflow to ingest data from PostgreSQL into Snowflake:
While Using the Openflow Connector
Keep the WAL logs at source to a minimal level. As a best practice, we should keep the flow enabled in lower environments to prevent the build-up of WAL when using AWS RDS, as that would ensure constant cleanup of the WAL generated by the RDS heartbeats.
This is because the Replication Slot created by changing the WAL level to logical
maintains the data for all the tables in the database, irrespective of the ones mentioned while creating the publication.
This blog provides more insights into this process and the workarounds required for this.
In higher environments, there is minimal chance of this happening, as the flow should be enabled by default to capture the constant changes in data. It is recommended that we do not alter the journal tables or the data in them. The connector uses them to update the destination table as part of the replication process.
While Using the Native Openflow Connectors
For large volumes of data, use a batch size of at least 100,000 in the GenerateTableFetch
and PutDatabaseRecord
processors. This ensures that data is loaded by Creation of Temporary Stages and COPY
command which is faster compared to JDBC
based batch INSERT
s.
The Source System should maintain consistency in the watermark columns for the incremental process to work.
Closing
Openflow’s capabilities extend far beyond this. It is continually expanding its array of connectors to integrate with an extensive range of structured and unstructured data sources, including SharePoint, Google Drive, Slack, Google Ads, Meta Ads, and many others.
This collaboration between Snowflake and Openflow is transforming data management by offering a unified platform that supports various data types and modes, whether batch or streaming, from virtually any source and at any scale.
If you’re eager to explore Openflow’s seamless data integration capabilities, contact the experts at phData today!
Need additional help?
As Snowflake’s 6X Partner of the Year, phData is more than capable of helping your business succeed with the Snowflake platform. Contact us today for help with Snowflake migrations, strategy, AI use cases, and much more!
FAQs
Question: What is Apache NiFi used for?
Apache NiFi is a powerful and reliable system designed to automate and manage the flow of data between different systems. It serves as a data integration tool that enables users to extract, transform, and load data from various sources. NiFi provides features like data provenance tracking, dynamic prioritization, secure communication, loss-tolerant, and guaranteed delivery. These capabilities ensure low latency, high throughput, and a seamless experience for users in terms of design, control, feedback, and monitoring.
What does Openflow do?
Openflow offers comprehensive solutions for enterprise organizations to gather a diverse range of multi-modal data, whether it’s structured or unstructured. This enables these organizations to effectively leverage this data for their AI initiatives or various other data-driven projects. With Openflow’s capabilities, businesses can ensure they have the necessary data to fuel innovation and make informed decisions. Snowflake created Openflow based on technology acquired via Datavolo in November 2024.