This post was co-written by Arnab Mondal and Ayush Kumar Singh
Fivetran’s LDP, or Local Data Processing (which was previously known as HVR or High Volume Replicator), is a data replication tool that helps businesses move data from one data source to another. LDP is a comprehensive tool that can be used to replicate data from a variety of sources, including databases, files, and applications.
Fivetran LDP is compatible with popular operating systems like:
AIX_6.1-POWERPC-64BIT (AIX: 6.1, 7.1, 7.2)
Linux (x86-64 bit) based on GLIBC 2.12 and higher
Solaris for SPARC: 10, 11.x
Windows (PC): 8, 10. Windows Server: 2012 R2, 2016, 2019
In this blog, we will do a deep dive into understanding LDP Architecture.
LDP uses a distributed architecture for data replication. The hub and Agents (optional) are deployed on separate machines. This architecture allows LDP to scale properly so that it can handle large data sets and high volumes of data replication traffic.
LDP can capture or read transaction logs from source locations in real-time from the source. The data is compressed, encrypted (if desired), and then sent to a central hub machine. The hub system then integrates (applies) the data into the target location(s).
In Fivetran’s LDP, a location refers to a specific storage space (database or file storage) where it can replicate data from a source location or storage space where LDP can replicate data to the Target location.
For example, if you have a MySQL database that you want to replicate to a cloud data warehouse like Snowflake Data Cloud, you would create a source location in LDP for that MySQL database and a target location for Snowflake.
Each location in Fivetran has its own set of configuration settings, which you can customize based on the specific database you are replicating to or from.
We can also group together locations with similar properties to form a Location Group. For example, we can have source location groups or/and target location groups.
In Fivetran’s Local Data Processing (LDP), a channel is a logical grouping of data that you want to process together. A Channel is a means of connecting a source and target Location through a sequence of actions that defines the behavior of your replication.
Channels also define the replication topology in LDP , like whether the replication will be 1-1, 1-many, or many-1, unidirectional, bidirectional, or multidirectional. Within Channels, various actions are carried out, such as data replication and comparison, scheduling, error handling, performance tuning, etc.
When you create a channel, you need to specify the tables or views that belong to that channel which you can edit later. You can also specify a set of rules for how LDP should process the data in that channel through channel actions, such as how to handle updates and deletes, how to handle conflicts, and how to transform the data.
Each channel should have at least one source location (or location group) and one target location (or location group).
Each channel must also contain two actions: Capture and Integrate.
For a detailed understanding of channels and locations, we can visit the official Fivetran documentation.
From the architecture diagram above, we can see LDP has the following components:
Source/Target Location: A location refers to a storage area, such as a database or file storage, where Local Data Processing captures changes (source location) and integrates changes (target location).
High-Volume Agent (HVA): The High-Volume Agent (HVA) is an installation on a remote source or target machine that enables a distributed setup. Acting as a child process for the hub system/machine, the HVA securely connects (using TLS) to the Local Processing Hub System through a designated TCP/IP port number for remote location access. The HVA is compatible with Linux, Unix (Solaris, AIX), and Windows machines.
HVA sits closer to the source and target location and is used to compress, optionally encrypt and securely send data from the source to HUB and then on the target side, and it can uncompress, decrypt, and write data to the target.
Even though phData and Fivetran both recommend using a distributed setup with agents installed on source and target systems, it is not always possible to install agents on source or target systems may be due to governance or any other issues.
In such cases, there is support for an Agentless architecture that utilizes database connection protocol like Oracle TNS to connect directly to a database location without any intermediate agent. But an agentless LDP might be significantly slower in replication.
For more information on HVA, visit the official documentation.
LDP HUB System
HUB: A hub is a logical entity generated within the LDP Hub System. If multiple hubs are created within the Local Processing Hub System, the Hub server will generate separate Schedulers for each hub
The hub software can also be configured to act as a High Volume Agent.
HUB Server: The Local Processing Hub Server functions as the access point for any remote connection to the Local Processing Hub System.
When necessary, the hub server process creates child processes, including the Scheduler and Local Data Processing worker (executable). The hub server can manage one or multiple Hubs (logical hubs).
The Local Data Processing process running on the hub machine serves two primary functions: running a lightweight webserver to enable REST API access and managing the Scheduler.
Scheduler: For each logical hub, a separate Scheduler is created and executed by the Local Processing Hub Server.
The Scheduler service, managed by the Local Processing Hub Server, handles replication jobs (such as Capture, Integrate, Refresh, and Compare jobs) that transfer data between source and target locations. The capture and integrate jobs are initiated by the Scheduler on the hub machine to capture or integrate changes, which connect to source and target locations.
Jobs: In LDP, a job refers to a process that performs a specific task, such as capturing changes from the source location(s), refreshing data, integrating changes to the target location(s), or comparing data between source and target location(s). These jobs are essential components of the replication process between source and target locations. A job can be in one of the following states in LDP:
ALERTING: The Scheduler changes the job’s state to ALERTING if it fails to run successfully. The job is then retried and eventually runs again.
DISABLED: If a job is disabled, it cannot be resumed. This is different from SUSPENDED.
DONE: This state indicates that the most recent event of the corresponding type (Activate, Refresh, or Compare) has been completed.
ERROR: This state indicates that errors occurred during job execution.
FAILED: This state indicates that the job execution has been canceled.
HANGING: If a job stays in the RUNNING state for too long, it may be marked with the HANGING state. If it finishes successfully, it will become PENDING.
PENDING: This state indicates that the job is yet to be executed.
READY: This state indicates that the job execution is completed.
RETRYING: This state indicates that the job failed and is restarted at least once during the job processing.
RUNNING: This state indicates that the job execution is in progress.
SUSPENDED: This state indicates that the job execution is paused and can be unsuspended, which means that it will go into a PENDING or RUNNING state.
- WAITING: This state indicates that the job will run at a scheduled time.
- Repository Database: The repository database consists of a collection of tables that store metadata definitions for the replication process between source and target locations.
- The Local Data Processing repository tables, located within the repository database, hold all specifications of replication, including the name of replicated databases, replication direction, and the list of tables to be replicated. For a complete list of databases that Local Data Processing supports as a repository database. For more information, please refer to the Repository Database section in the Fivetran official documentation.
LDP Log Files: These files contain messages from scheduled jobs, such as Refresh, Compare, Capture, and Integrate jobs, and provide a detailed view of activities like transport, routing, and integration.
The Hub Server logs information about the Hub server. For each logical hub, a separate set of log files is created and maintained by the Local Processing Hub Server.
These log files can either be accessed directly through the hub machine or through WebUI.
Router Files: A separate set of router files are created and maintained for each logical hub. Router files are internal files created by the Local Processing Hub Server to track the progress of data replication. They contain information such as:
The state of capture and integration jobs
Instructions for replication jobs
For each logical hub, a separate set of router files is created and maintained by the Local Processing Hub Server.
Router files can be used to troubleshoot data replication problems. They can also be used to audit data replication activity.
Here are some of the benefits of using router files:
Troubleshooting: Router files can be used to identify the source of data replication problems. By tracking the progress of data replication, you can identify the point at which the replication process failed. This information can be used to troubleshoot the problem and restore the replication process.
Auditing: Router files can be used to audit data replication activity. By tracking the data that has been replicated, you can ensure that the replication process is functioning as expected. This information can also be used to identify any unauthorized changes to the data.
A user can interact with Fivetran Local Data Processing using one of three interfaces:
Web UI: The web UI is a graphical user interface that can be used to configure and operate LDP . It is the primary and most user-friendly way to interact and explore all features of LDP.
It provides a comprehensive visualization of replication (capture and integrate) along with a dashboard and event log viewer. The web UI can be used on both computer and tablet screens. Currently, LDP WebUI does not support mobile phones.
REST API: The REST API is a set of HTTP endpoints that can be used to programmatically interact with Local Data Processing. The REST API can be used to configure and operate Local Data Processing, as well as to retrieve and manipulate data.
It is mostly used by advanced users or developers to automate the interactions with LDP by writing programs to interact with LDP.
Command Line Interface (CLI): The CLI is a command line interface that can be used to configure and operate Local Data Processing from a terminal. The CLI can be accessed directly on the hub machine or from a remote machine.To access LDP, we can use Linux shell or Windows command prompt on the Hub machine or on the remote machine, provided we have already installed LDP CLI on the remote machine.
Even without LDP CLI installation on the remote machine, we can still access LDP using Rest API with the Curl command on CLI.
For more information, visit Fivetran’s official documentation.
Fivetran’s LDP architecture is a versatile and scalable solution for data replication. It can be used to replicate data between on-premises and cloud-based data sources, as well as between different cloud-based data sources. LDP offers a number of benefits, like
High performance: LDP can replicate data at high speeds.
Scalability: LDP can be scaled to meet the needs of large organizations.
Reliability: LDP is a reliable solution for data replication.
Security: LDP provides security features to protect data during replication.
LDP can be used for a variety of use cases, including data migration, data consolidation, and data synchronization. If you are looking for a solution for data replication, LDP architecture is a good option to consider. It offers a number of benefits that can help you to improve your data management processes.
If you need further information or help regarding the LDP architecture, feel free to reach out to us, and our best minds with Fivetran Certifications will be there to assist you in your journey to build the best HVR architecture according to your requirements.
LDP is based on a distributed architecture for database and file replication. LDP is a comprehensive software that has all of the required modules to run replication.
Organizations that want to replicate data between transactional databases and file systems could leverage the power of LDP. It also supports CDC or real-time Change data capture for databases, migrations, and analytics.