July 17, 2024

Coalesce Basics: Getting Started & Creating Your First Data Pipeline

By Tushar Kakaiya

Coalesce is becoming a trusted and preferred platform for many of our clients who have the Snowflake AI Data Cloud and want to automate data transformations at scale with a no-code/low-code model. 

For this post, we’ll explore the basics of Coalesce, including prerequisites and initial configurations of a Coalesce project. We’ll also take a tour of its interface—including the Build and Deploy tabs—and learn how to add sources and create node types. Finally, we’ll demonstrate how easily you can create a sample data pipeline within Coalesce. 

With that, let’s get started!

Prerequisites

  • You’ll need a Coalesce account with a project and workspace set up.

  • Integration with GitHub, Bitbucket, or Azure DevOps.

  • Access to create and write tables in your Snowflake account.

  • Snowflake account credentials added to Coalesce.

  • Complete the storage mapping in Coalesce, linking it with your Snowflake database and schemas.

How to Create a Coalesce Trial Account

Sign in to any existing Snowflake account, or if you don’t have one yet, create a new Snowflake trial account. Once logged in, navigate to Data Products > Partner Connect in the left-hand navigation bar. Scroll down to locate the Coalesce tile listed under Data Integration, or use the search box to find “Coalesce.” 

After connecting, follow the steps to activate your Coalesce Trial Account.

Initiating Projects and Workspaces: Coalesce Web UI Setup

Once we open the Coalesce Web UI, we land on the screen of Projects, where we can see all our projects and their associated workspaces. Projects are a useful way of organizing your development work by a specific purpose or team goal, similar to how folders help you organize documents in Google Drive. 

When you create a project, it will ask for the project name and description along with the Git Repository URL. You will need access to a free or existing Git account from your preferred service (GitHub, Bitbucket, GitLab, or Azure DevOps Git) to set up version control. 

For the purposes of this blog, we recommend using Github. While it’s possible to skip this step and create a project without a Git repository, leveraging version control maximizes the benefits of Coalesce’s collaborative development capabilities. Additionally, skipping this step can restrict you to a single workspace.

If you’re utilizing a GitHub repository URL, you’ll need to add a Git Account by generating an access token and authenticating Coalesce to connect to GitHub. Each project is then linked to a Git or similar version control repository, ensuring it remains a use-case-specific entity.

Once the project is created, you can establish a Development Workspace mapped to individual branches of the Git Repository. Developers have the flexibility to create workspaces tailored to their assigned project tasks or specific subject areas, optimizing workflow efficiency.

After creating a workspace, launch it to access your workspace, opening the standard Coalesce Web Canvas for Development. Here, you’ll find the default Build Interface where you’ll develop data pipelines. 

Before diving into pipeline construction, it’s essential to configure certain properties crucial for initiating the pipeline building process. You can get the Connection Settings by clicking on the Edit Workspace Icon near the Launch button. 

This action will direct you to the Settings section, where you’ll input the Snowflake Account information (extracted from the Snowflake Account URL, as indicated in the screenshot below).

Once the Snowflake Account details are provided, proceed to the User Credentials Tab to input the necessary credentials for authentication. You can select from a variety of authentication types available. 

With the correct credentials in place, you can test the connection with your Snowflake instance to ensure successful authentication.

Next, navigate to the Build Settings to add the Storage Locations required for your data pipeline configuration. This is available in the left side panel under the Build Interface. 

Storage Locations serve as logical identifiers that we assign, eventually mapping them to Snowflake Databases and Schemas via the Storage Mappings options. In this setup, we’ve established two Storage Locations: one designated for RAW data (the source dataset) and another for TARGET, intended for consumption or business intelligence purposes within our data model.

To link the Storage Locations with their respective Snowflake Database and Schema, navigate to the Edit Workspace once again and access the Storage Mappings tab. Here, you’ll map the designated storage locations with the appropriate Database and Schema within your Snowflake account.

For the RAW Location, we’ve selected the SNOWFLAKE_SAMPLE_DATA database, which is already available within Snowflake, and the TPCH_SF1 schema. Meanwhile, for the TARGET location, we’ve opted to create a new database within Snowflake, named BI_CONSUMPTION, and established a schema under it called MODELED. 

Return to the Browser tab within your Build interface. With that, you’ve successfully completed the initial setup required before building your data pipeline!

Navigating the Coalesce User Interface

Next, let’s learn about the Coalesce User Interface by looking at its basic parts. When you start working in Coalesce, you’ll see the Build Interface on their Canvas. Here, you can create, modify, and publish nodes to transform your data.

Nodes are like a visual depiction of your tables in Snowflake. You can arrange them in a way that shows how they relate to each other. This helps you understand how your data moves around. Their positioning in the canvas and the connection with other nodes build a Directed Acyclic Graph (DAG) which also depicts your data pipeline and the relationship/dependencies between all the nodes. 

Under the Build interface of the Browser tab, you can visualize your node graph using the Graph, Node Grid, and Column Grid views. In the upper right hand corner, there is a person-shaped icon where you can manage your account and user settings. There are options in the left panel where you can create Jobs which can be created from the DAGs or from some of the nodes from the DAGs you have created. 

A Job can be created to refresh only a set of Fact table(s) so that other tables are not impacted. Right now, this area is empty. We can build out nodes and Graphs in this pane. 

Next to the Build Interface is the Deploy Interface. Here, we can push our pipeline to different environments like Development, Testing, and Production along with their IDs. In this interface, you can also find the Activity Feed, displaying the history of deployments and refreshes, along with the IDs associated with each run. 

If there are no runs started with the Coalesce Project, the Activity Feed will appear empty. Additionally, there’s an option to Generate Access Token, necessary for Coalesce CLI-based deployment and refresh, or API-based refresh.

How to Add Data Sources in Coalesce

To create your Data Pipeline, we may start adding data sources in your workspace. You can add tables from the Snowflake Schemas, which we have configured in the Storage Location Mapping. 

You can click on the + button near the Search box to add sources. 

Once you click on Add Sources, it will show all the tables under the Storage Mappings, which you have configured under the Workspace. Select the tables which you want to add as Source in your Data Pipeline. You can click on any table to select it and on the right hand side panel, you will notice the Sample 100 records for your preview.

Once you add these sources, you’ll see the Graph, and have your Source Nodes. Note that they are designated in Red. This is a Source Node type (there are other Built-In Node types in Coalesce, which we will discuss in the next section) and each one of them is associated with its own color that will help us with visual organization when viewing a Graph. 

If by any chance you are unable to see the Nodes in the Graph, you can click on the Draw Optimized Graph button (⟳ – as highlighted in the screenshot below).

In the Nodes Grid, you can find tables listed under the Source Category. Clicking on any of these tables/Nodes will display the column names and their corresponding data types in the grid below.

Understanding Built-In Node Types in Coalesce

Coalesce provides various built-in Node types commonly used in data warehousing architectures for building data models or transformations. Node types control the actual creation and data manipulation of all database objects within your data pipeline. 

Additionally, you can create your own Node types, referred to as User Defined Nodes (UDNs). Node types are defined by Jinja/SQL templates and a YAML Node definition which has numerous options for customization in the Node Type Editor.  

The table below highlights the description and color codes for the various Node types and recommendations of their usage. 

Node TypeDescription
Source Nodes

Represent tables, external tables, or views containing raw data queried for processing. These are existing tables in Snowflake.

Stage Nodes

Intermediate nodes where data is prepared with business logic. By default, they truncate data before each run and hold only the current batch. Can be materialized as tables or views. Recommended for complex logic splitting.

Persistent Stage Nodes

Intermediate nodes where data is prepared with business logic. By default, they truncate data before each run and hold only the current batch. Can be materialized as tables or views. Recommended for complex logic splitting.

Fact Nodes

Implementation of a Kimball Fact Table, containing measurements or metrics of a business process. Can merge data using a business key or use INSERT statements.

Dimension Nodes

Descriptive Nodes representing dimensions such as time and location, or storing historical data. Requires a business key and supports Type 1 and Type 2 slowly changing dimensions.

View Nodes

Used for creating SQL views for improved readability in the DAG. Disabled by default but can be enabled in Build Settings > Node Types.

How to Create a Sample Data Pipeline in Coalesce with Basic Transformations

Now we will enhance our initial graph where we added Source Nodes, and create a Sample Data Pipeline from built-in Node types available in Coalesce.

You can Select any of the Source Nodes (or even Multi Select by pressing and holding the Shift key) and right click then Select Add Nodes to add various types of Nodes to the next stage of the pipeline. We will create the Stage nodes for all the eight Sources we have added. 

Clicking on any Stage Node opens the Node Editor, where you can perform basic transformations. You can rename column names, change data types, nullability, and default values directly in the displayed grid. You can also add new derived columns by inserting them at the last row. In the Transformation field, you can write SnowSQL-based code for column transformations.

At the top right side of the screen, press the ellipsis to select Create All from the menu (which will execute the DDL in your Snowflake account) and then press the Run All button (which will perform DML operations in your Snowflake account to populate your Stage nodes).

Now, after creating our Stage layer Nodes, we can perform some specific transformations by joining two of our Stage Nodes. Press the Shift Key and Select the STG_LINEITEM and STG_ORDERS nodes. Once selected, right click and select Join Nodes > Stage to create a Stage Node. 

By doing this, you create a different Node in your graph, which Coalesce names STG_LINEITEM_ORDERS (type of Node followed by participating joining tables separated by underscore sign). 

You can rename this by clicking on the Edit name Node pencil icon. There are two sections shown for this Node, which is the result of the join operation. The Mapping grid displays the structure of your Node along with column metadata like transformations, data types, and sources. 

On the right hand side of the Node Editor is the Config section, where you can view and set configurations based on the type of Nodes you’re using. You can select the option as per your requirement. 

Select the Join grid in the Node Editor to define the logic to join data from multiple Nodes, with the help of an already generated JOIN statement. 

Add column relationships in the last line of the existing JOIN statement using the code highlighted in Bold letters:

				
					FROM {{ ref('RAW', 'STG_LINEITEM') }} "STG_LINEITEM"
INNER JOIN {{ ref('RAW', 'STG_ORDERS') }} "STG_ORDERS"
ON "STG_LINEITEM"."L_ORDERKEY" = "STG_ORDERS"."O_ORDERKEY"


				
			

To check the accuracy of the syntax, use the Validate Select button in the lower half of the screen. To create your Stage Node in Snowflake (to execute the DDL statement on the table), click the Create button in the lower half of the screen. Then click the Run button to populate your table (execute the DML statement on the table). 

If you have any sort of errors, it will fail to validate and show the error symbol with a red mark close to the STG_LINEITEM_ORDERS Node.

Similarly, you can create the Dimension tables/views for your data models using the Dimension Nodes in Coalesce. Right click on the STG_PART stage Node and select the Dimension Node under the Add Node option, which will create a DIM_PART dimension Node. 

The Dimension Node will create several auto created attributes, like DIM_KEY and some timestamp columns.

If you want to track the changes over time and need a Type 2 Dimension table, check the column(s) from the Change tracking grid (under the Config options) and click on the > button to add it as a tracking column(s).

Conclusion

And that’s a wrap! We hope this blog has helped you build data pipelines efficiently with Coalesce.

Need further assistance with Coalesce?

Our experts are here to help! Whether it’s setup, implementation, or modernizing data transformation, phData can support your business in creating robust data transformations, enhancing pipelines, and managing complex data analyses using best practices.

FAQs

While Coalesce offers useful GUI features, it’s beneficial to have experience in data warehousing and modeling to build the data pipelines in Coalesce. Proficiency in SQL, familiarity with Jinja syntax, and working with any of the popular Codebase version control tools are also recommended. Last but not the least,  your experience in Snowflake is essential as the tool is specifically designed for Snowflake.

No, although the Default Nodes in Coalesce cover most of your pipelining needs, Coalesce also offers User Defined Nodes for customization. This allows you to create your own node types and options, leveraging Snowflake’s capabilities to implement customized features. For example, you can create a node in Coalesce to use the Data Masking policies for masking PII data, and build a pipeline using this custom node.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit