Ultimate Guide to Data Lineage Directly in Snowflake

In the world of data-driven organizations, understanding the origin, movement, and transformations of data is paramount. Data lineage plays a crucial role in providing transparency, ensuring data integrity, and enabling informed decision-making.

While traditional methods of tracking data lineage often involve manual documentation and complex processes, the Snowflake Data Cloud offers a powerful and streamlined solution. With its built-in data lineage capabilities, Snowflake allows organizations to directly capture, visualize, and analyze data lineage within its cloud-native environment.

In this blog, we will explore achieving data lineage information directly in Snowflake, exploring its key features and advantages, and how it empowers organizations to gain comprehensive visibility into their data ecosystem.

What is Data Lineage?

Data lineage refers to the ability to trace the origin, movement, and transformation of data throughout its lifecycle. It involves understanding the path of data from its source systems, through various processes and transformations, to its final destination. Data lineage provides a detailed understanding of how data is generated, captured, modified, and utilized.

Data lineage is essential for several reasons:

Data Governance – Data lineage enables organizations to track data usage, ensure compliance with regulations, and understand the impact of data changes.

Data Integrity and Trust – Data lineage helps identify potential data quality issues, troubleshoot data discrepancies, and ensure data integrity.

Impact Analysis – When changes occur in data sources, schemas, or transformations, data lineage helps assess the potential impact on downstream systems, reports, or analytics.

Regulatory Compliance – Many regulations, such as GDPR, HIPAA, and SOX, require organizations to have visibility into data movement, processing, and access. Data lineage helps organizations demonstrate compliance by providing an audit trail of data activities.

Data Lifecycle Management – Data lineage assists in managing the lifecycle of data assets. It helps organizations identify unused or redundant data, optimize data storage, and make informed decisions regarding data retention and archiving.

Traditional Approaches to Data Lineage

Acquiring data lineage can be a difficult task. Traditional methods for tracking data lineage typically involve manual documentation and reliance on stakeholders’ knowledge. This would include reading documentation created by stakeholders, database schema and ETL documentation, interviewing stakeholders for further information, or manually tracing a piece of data from source to destination. 

The obvious major issue with this approach is the many points of failure due to human error. Processes could go undocumented, stakeholders that originally requested the data could no longer be with the company, or processes could be so complex that it would take many hours of work to trace data. A more efficient approach is needed to provide accurate data lineage information.

Why Snowflake is Ideal for Data Lineage

Snowflake is a cloud-native data warehousing platform that separates computing and storage, allowing for automatic scaling and pay-per-use pricing. Unlike traditional data warehousing solutions, Snowflake brings critical features like Data Sharing, Snowpipe, Streams, Time-Travel, and native data lineage capabilities to the enterprise data architecture space.

Snowflake automatically captures and maintains data lineage information without requiring manual intervention. This data is available to Snowflake users through a special view called ACCESS_HISTORY, which holds the history of tables, views, and columns for up to one year. This allows users to simply create a query on an object as granular as a column and view the history instead of having to trace it manually.

Leveraging Snowflake's Data Lineage Features

The best part of Snowflake’s data lineage feature is that it’s implemented automatically by Snowflake. There is no manual intervention required. The ACCESS_HISTORY view is created within the SNOWFLAKE database in the ACCOUNT_USAGE schema. (note: ACCESS_HISTORY view is only available for Enterprise accounts and above)

The view contains eight columns:

QUERY_ID: An internal, system-generated identifier for the SQL statement. This value is also mentioned in the QUERY_HISTORY view.

QUERY_START_TIME: The start time of the query in UTC.

USER_NAME: The user who issued the query.

DIRECT_OBJECTS_ACCESSED: A JSON array of data objects such as user-defined functions (i.e., UDFs and UDTFs), stored procedures, tables, views, and columns directly named in the query explicitly or through shortcuts such as using an asterisk (i.e., *).

BASE_OBJECTS_ACCESSED: A JSON array of all base data objects to execute a query, including columns, external functions, UDFs, and stored procedures.

OBJECTS_MODIFIED: A JSON array that specifies the objects that were associated with a write operation in the query.

OBJECTS_MODIFIED_BY_DDL: Specifies the DDL operation on a database, schema, table, view, and column.

POLICIES_REFERENCED: Specifies information about the enforced masking policy set on the column and the enforced row access policy set on the table, including policies set on intermediate objects or columns.

To learn more in-depth details about this view, read the Snowflake documentation here.

Example of How ACCESS_HISTORY Works

First, I created a test table called DATA_LINEAGE_TEST:

Next, I perform some operations on the table that we’ll be able to see in the ACCESS_HISTORY view later:

Now, since the columns in ACCESS_HISTORY are in JSON format, we’ll need to pull only the data we need from the JSON arrays. The following query returns the history of actions taken on the DATA_LINEAGE_TEST table we performed:

Results:

As you can see, all five tasks are returned in the order they were performed. Taking it one step further, if you require the actual queries performed, you can join the ACCESS_HISTORY view to the QUERY_HISTORY view (within the same schema):

These queries are the building blocks of what you need to view the data lineage of any data within Snowflake. From here, you can create dashboards in a programming language or the BI tool of your choice for ease of use by your engineering team.

Benefits of Data Lineage

With data ecosystems becoming increasingly larger and more complex, data lineage is more crucial than ever. It has become a cornerstone of data management, and utilizing data lineage has several signigicant benefits, including:

Data Quality

Data quality issues can arise in any data warehousing system, including Snowflake. However, using data lineage can help determine quickly and efficiently what caused a data quality issue, helping your engineers fix the problem.

Improved Data Governance

Data lineage in Snowflake helps organizations establish and enforce data governance

policies, standards, and controls. By tracking data lineage, organizations can ensure compliance with regulatory requirements, monitor data quality, and enforce data access and security protocols.

Impact Analysis and Change Management

Utilizing Data lineage in Snowflake enables organizations to conduct impact analysis when making changes to data structures, transformations, or processes. It helps identify the downstream systems, reports, or analyses that may be affected by changes, enabling organizations to plan and manage changes effectively while minimizing risks and disruptions.

Closing

By leveraging Snowflake’s automatic lineage capture and granular details, organizations can gain a comprehensive understanding of their data flow, transformations, and dependencies. Snowflake’s data lineage capabilities empower stakeholders to perform impact analysis, trace data origins, and navigate complex data landscapes with ease.

With Snowflake’s data lineage directly integrated into the platform, organizations can streamline their data governance processes, enhance collaboration among teams, and improve overall data quality. They can confidently comply with regulatory requirements, optimize data operations, and unlock the full potential of their data assets. All with the power of Snowflake’s native data lineage features.

Have a question about data lineage? Need help to succeed with Snowflake? phData can help! Reach out today with your most pressing Snowflake questions.

Yes. There are many data lineage tools available that work with Snowflake. OpenLineage, for example, is a plugin that takes the data from the ACCESS_HISTORY Snowflake view and visualizes it into Directed Acyclic Graph (DAG) for easier viewing, whereas some ETL tools such as Data Build Tool (dbt) have data lineage and DAG visualization capabilities built right in.

Unfortunately, no, Snowflake’s data lineage capabilities only allow for capturing lineage within Snowflake. An external tool, as mentioned above, would need to be implemented to provide such functionality.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit