October 20, 2023

How to Pull Data From On-prem Systems Using Fivetran’s HVA Connectors

By Arnab Mondal

Production databases are a data-rich environment, and Fivetran would help us to migrate data by moving data from on-prem to the supported destinations; ensuring that this data remains uncorrupted throughout enhancements and transformations is crucial. 

Fivetran’s database connectors are made to be safe in this aspect so that you don’t have to worry about securing your Production data continuously. As a safety feature, source and target structure alignment is always checked for compatibility before connecting. Some of the databases supported by Fivetran are:

In this blog, we will review how to pull Data from on-premise Systems using Fivetran to a specific target or destination.

Database Integration Overview

Let us briefly overview the necessary steps to pull data from on-prem systems using Fivetran.

  1. Connecting Your Database: To begin usage, connect Fivetran to the production database or use read-replica via a Fivetran user created in your database.

  2. Initial Sync: Fivetran will collect data from all selected tables and compile those. Remember that Fivetran can only do this activity for tables and data it has access to. The time taken to complete an initial sync varies by user, but generally, it takes less than a day.

  3. Transformation & Mapping: Fivetran goes through a considerable transformation and typecasting process where it collects all the data from every column in the source and maps it to a respective column in the target. Firvetran also transforms any unsupported datatypes into possible supported datatypes in the destination. Note that aggregations are not covered at this stage.

  4. Loading Schemas and Tables: Once Fivetran completes building the columns and tables to match the source structure, the source data is mapped and populated to the respective targets during the initial sync.

  5. Incremental Updates: With Fivetran, you can choose how often you load updated or changed data. Fivetran generally loads the updated data in batches. These batches run on time frequencies selected by the user, ranging from 5 minutes to 24 hours. Fivetran operates to identify only the new changes and updates in your schema and then maps them to the target before collecting and loading the data into them using a merge operation.

We will now go over all the topics one by one.

Connecting Your Database

Connection Options

Fivetran connectors can be categorized as pull connectors. Hence, Fivetran must have a way to connect or establish access to your source database. Some of the connection strategies supported by Fivetran are as follows:

  • Connect directly to the database port on your database host

  • Connect via an SSH tunnel

  • Connect via a reverse SSH tunnel

Safety Features of Fivetran

Fivetran works to integrate features and processes to ensure the safety of your data source. You can connect Fivetran with a read replica of your database if you do not want to connect your production workloads with Fivetran. Fivetran queries are optimized to be efficient and not clash with other queries, and hence, it is also safe to connect directly with the production database. 

The Fivetran queries generate a simple execution plan that will take a lower priority and run concurrently with any other query you may have. Fivetran uses multiple threads to make the syncs faster due to parallel processing. This means that they can execute multiple queries at the same time.

High-Volume Agent Connectors

The Fivetran High-Volume Agent connector can facilitate a high-volume data replication across multiple databases, and currently, there exist HVA connectors for the following databases (subject to change): 

  • Db2 for i (Beta)

  • Oracle

  • SAP ERP

  • SQL Server (Beta)

HVA or High Volume Agent. HVA connectors use on-premise software or an agent to connect and perform high-volume data replication tasks. The on-premise agent is responsible for sending data to Fivetran, which is then processed and loaded into the destination. 

There are multiple Network Configuration options to connect the databases and services using HVA. You can find more information about them in their official documentation

On Linux and Unix, HVA runs as a daemon or a system, whereas Windows runs as a Windows Service. Fivetran generally connects with HVA using a specific TCP/IP port, which is by default 4343.

Various Database Connection Methods

You can connect your Source Database with the Fivetran Saas in multiple ways, and we will briefly review them.

A screenshot showing an example of a direct connection.
Direct Connection
A screenshot showing an example of a private link connection.
Private Link Connection
A screenshot showing an example of an SSH tunnel connection
SSH Tunnel Connection
A screenshot showing an example of a reverse SSH tunnel connection.
Reverse SSH Tunnel Connection
A screenshot showing an example of a VPN Tunnel Connection.
VPN Tunnel Connection
A screenshot showing an example of a proxy agent connection.
Proxy Agent Connection

Advantages of HVA

HVA connectors are best suited for large amounts of data or databases with an extensive historical data repository. The most common example of such databases is where events are tracked. For software products or ERP backend databases, thousands of data units must be tracked and monitored. The distributed architecture of HVA provides the agents with performance, security, and scalability advantages.

  • Speed: The agent on the source database will filter the data before sending it through the data pipeline. Databases often write more information to a transaction log than is required. That’s why filtering close to the source improves the efficiency and increases the speed but only copying data, which is required.

  • Bandwidth: The HVA agent will compress the data before sending it, requiring less bandwidth and reducing the network cost. HVA also allows the capture of changes directly from various DBMS articles.

  • Security: The traffic between HVA and Fivetran is secure and encrypted during transit. It is locked down, and authentication is secured via an SSL certificate.

Initial Sync

Fivetran will connect with your source database during your initial sync and copy the contents of all the tables selected for the sync. The initial sync is performed in two different ways depending on whether your source database has:

  • B-Tree like MySQL or SQL Server:  Fivetran does the initial sync using a series of queries that look like the following code:

				
					select *
from {table}
where {primary key} > ?
order by {primary key}
limit {page size}
				
			

This strategy is efficient for B-trees, which are already ordered by the primary key, and hence, you do not need a sort-by clause in this query.

  • Heap like Oracle, PostgreSQL: Fivetran does the initial sync more complicatedly, where the small to medium tables are synced with a single SELECT query. For larger tables, the data is broken down into more manageable blocks and in order of the data being added to the database. Then, the different blocks are synced in chronological order.

Excluding Source Data

If you want to exclude some sources from your database and do not want to sync them, you could exclude the tables, schema, or columns on the Fivetran dashboard. You could also change the permissions on the Fivetran user in the source database, which would restrict access to those schemas and tables.

Check Initial Sync Progress

You should monitor the progress of the initial sync using the “Log” tab on the Fivetran dashboard. There you can: 

  • Check the recent log records so that the progress of the table imports can be tracked.

    • “Not Started” means the tables have not been imported yet.

    • “Completed” means that those tables have been imported.

  • Check the previous log records to track the sync progress.

Transformation and Mapping Overview

Like any other Cloud API connection, Fivetran syncs as many source schema and tables as possible with the help of the setup and configuration given. Fivetran database connectors do one-to-one mapping of the source schema and tables to the destination schema and tables. 

Columns are also mapped similarly unless the source datatype is not supported in the destination. Then, it is transformed into a data type that is supported and most similar to the one in the source database.

Loading Schemas and Tables

Fivetran allows one database connector to replicate multiple schemas, including the underlying tables. They are stored in the same logical schema or table hierarchy as it is stored in your source database. You can add a prefix of your choice to each of the schemas in the destination of the data.

Exceptions Where the Name Needs to be Changed

Fivetran tries to minimize database name changes as much as possible, but there are some exceptions to this rule where Fivetran enforces a name change. Those are done under the following scenarios: 

  • Non-ASCII characters, like Chinese characters, fall under the exception in Fivetran and are replaced by transliteration.

  • Any other characters other than numbers, letters, and underscores are replaced with underscores. 

  • All Upper-case characters are converted to lower-case.

  • All Numbers at the beginning of a name get replaced with an underscore 

  • The character limit is 127 for Fivetran, so any name that has more than 127 characters will get truncated 

Fivetran does not support changing the connector name; hence, changing the name of the destination schemas is impossible. To change the name of the destination schema, you need to set up a new connector and remove the old one when the new one has finished syncing. 

You could also use an SQL view on top of the existing schema with some new name, which would work better for your PowerBI or the SQL layer.

Incremental Updates

Fivetran can perform incremental updates of the new or any modified data from the source database to the destination database you migrated. It utilizes the native change capture mechanism to request only the changed data since the last sync, including deletes. Databases could use multiple Change capture mechanisms, like 

  • DynamoDB Streams, DynamoDB

  • Change Data Capture, MariaDB

  • Change Streams, DocumentDB

  • Oplogs, MongoDB

  • LogMiner, Oracle

  • Binary Logs, MySQL

  • Change tracking of change data capture, SQL Server

  • Shadow tables and trigger SAP ERP on HANA

  • Logical Replication or XMIN system column, PostgreSQL

  • Fivetran Teleport Sync

What is a Progress Cursor?

Fivetran always maintains an internal set of progress cursors, which will help you track the progress and pinpoint the exact place where the last successful sync was left off. You could also record the last sync for the row in a column like fivetran_synced with a UTC timestamp. This will make sure that you do not lose any data between hand-off between syncs.

The progress cursor is the main reason Fivetran’s system is tolerant and unaffected by service interruptions. If your destination database goes offline or any connection issue happens, the syncing will resume exactly where it left off as long as the log data still exists.

Deleted Data

All the data deleted will also be synced, and the tables with a primary key, the row, will be marked as deleted but not removed from the database. This is what is known as a soft delete. 

An extra column with the name _fivetran_deleted (Boolean) would be used to keep track of deleted records. (True value means deleted) . This is true for all database connectors except the following six databases, where you can click on each link to know more : 

  1. MariaDB

  2. Oracle

  3. Snowflake

  4. MySQL

  5. PostgreSQL

  6. SQL Server

Database Sync Frequency Logic

By default, the database connectors sync the modified and new data every 15 minutes. Every table’s size might differ from the data size, leading to a job running for more than 15 minutes. In that case, if your job begins at 4 p.m. and goes on till 4:16 p.m. The next database sync will happen at 4:30 p.m. 

The update interval can be updated in your Fivetran dashboard. If the Database Sync goes through multiple or consistent errors, Fivetran will retry the sync after the specified update interval or one hour, whichever is shorter.

Extra Points

Data Warehouses as Source

Currently, it is in Beta, but you can use BigQuery and Snowflake as data sources in Fivetran. This allows possibilities for Reverse ETL and opens up ways to efficiently utilize your data warehouse and prospects.

Conclusion

Congratulations on your successful migration from on-premise systems using Fivetran. Reach out to our team of Fivetran experts if you need any assistance with your migration.

FAQs

Fivetran uses an on-premise agent or HVA to pull data from on-premise and send the data to Fivetran.

Fivetran is an ELT tool that is cheaper and easier to implement than traditional ETL tools or pipelines.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit