August 24, 2021

What is Data Modeling and How Do I Choose the Right One?

By Nitin Mule

Building a successful data management solution requires several correct choices to be made in terms of technology, architecture, and design. Modern cloud based data platforms like the Snowflake Data Cloud can address most of your technological needs, but you still need to ensure that the design and structure of the data complement the technology you have chosen, and address all the business requirements in an effective manner. 

Data modeling is an important part of the design process that ensures the effectiveness of your data platform, and if you’re in the process of selecting a data model for your business, you’ve come to the right place. 

In this post, we’ll examine what a data model is and why it’s so important to choose the right model for your data management initiatives. From pros and cons to a comprehensive ratings chart, this blog will ultimately give you more confidence to select and implement the best data model for your business.

What is Data Modeling?

At a high level, data modeling is the process of visualizing and representing data for storage, to facilitate efficient loads, retrieval, and analysis. The model is a conceptual representation of the data, the relationships between data, and the rules. The modeling itself can include diagrams, symbols, or text to represent data and the way that it interrelates. 

Because of the structure that data modeling imposes upon data, the process of data modeling subsequently increases consistency in naming, rules, semantics, and security – while also improving data analytics.

Why is Data Modeling Important?

Any modern, data-driven business generates a vast amount of data in its various operations. Due to disparities in business functions, systems and processes, it is required that the data should be consolidated properly, cleansed to remove noise, and transformed to enable meaningful analytics. For this purpose, it becomes necessary to conduct a Data Modeling exercise to structure the data consistently, and store it in a format that can be usable for a variety of purposes. 

In addition to that, an effective data model provides a solid foundation for any Data Warehouse to sustain growing data volumes and adjust easily to addition or deletion of data entities. 

Changing Needs of Data Lifecycle

The Data Modeling process varies as per different stages of the Data Lifecycle. The requirements to manage data differ based on how data is generated, stored, and retrieved. A good Data Model should address the considerations for the specific stage of the Lifecycle it is being implemented for. At a high level, a typical Data Lifecycle looks like this: 

A 4-part diagram illustrating a Data Lifecycle

What is the Most Important Consideration in Data Modeling?

There are several aspects to consider when choosing the right data model. These aspects vary based on the stage of the Data Lifecycle we are designing for. These factors are as follows:

There can be other factors too, but those mentioned above greatly influence the decision-making for choosing the right data model. 

Rating the Importance of Different Factors at Each Stage of Data Lifecycle:


Creation

Warehousing

Analytics

Archival

Speed and Frequency of Data Creation and Modification

Requires fast writing speeds to ensure faster execution of transactions


Data generated by end-users or automated systems

Requires moderately fast writing speeds


Larger volume of data that needs to be stored consistently

Requires moderately fast writing speeds


Aggregations may be needed for efficient reporting

Slower speeds are ok 


Successful execution of the archival process is more important than speed

Speed of Data Retrieval

Data may be retrieved as soon as it is written


Granularity of data retrieval may be the same as data inserted, with minimal transformations

Data retrieval is needed to generate smaller datasets periodically


Data retrieval may require transformations and aggregations

Data retrieval is needed for reporting or dashboarding purposes


To ensure better end-user experience, data retrievals should be faster, and repetitive transformations and aggregations should be part of the data storage/data model

Data retrieval is needed only in exceptional circumstances (audits, disaster recovery, etc.)


Consistency with live datasets is more important than retrieval speeds

ACID Properties

Data Creation is part of transactions involving multiple steps


Compliance to ACID properties is critical to ensure transactional consistency

Compliance with all ACID properties is not critical, but the consistency of the dataset before and after a batch / micro-batch execution is expected

Compliance with all ACID properties is not critical, but the consistency of the dataset before and after a batch / micro-batch execution is expected

Compliance to all ACID properties is not critical, but consistency of the dataset before and after a batch / micro-batch execution is expected

Business Scope

Focused on a specific business activity

Should cover multiple business functions or an entire enterprise

Focused on specific reporting or AI/ML requirements

Scope depends on the Data Lifecycle stage (Creation, Warehousing, Analytics)

Access to Lowest Grain of Data

In most cases, if a system is generating data at its lowest granularity, then access to that same level of detail is also required

Historical data may need to be stored at different granularities


Access to lowest level of data is important

Historical data may need to be stored at different granularities


Access to lowest level of data may not be important

Historical data may need to be stored at different granularities


Access to lowest level of data may not be important

Breaking these down, the Creation section is generally modeled in 3rd Normal Form (3NF) which is optimized for transaction speed and the archival layer is highly dependent on the use case and storage medium. A platform like Snowflake does not focus on these two life cycles yet, so for the remainder of this blog, we will focus on Warehousing and Analytics stages, as these use cases greatly influence how the data is utilized by modern analytics-oriented applications and AI/ML algorithms. 

Depending on the factors discussed above, the following data modeling approaches are suitable, and widely used as best practices in the industry:

What is a Data Vault?

Data Vault is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. It is also a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed, and resilience to change as well as emphasizing the need to trace where all the data in the database came from. This means that every row in a data vault must be accompanied by record source and load date attributes, enabling an auditor to trace values back to the source. It was developed by Daniel (Dan) Linstedt in 2000. He describes the Data Model as below:

“The Data Vault Model is detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between the 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise.”

A more complex diagram with 7 boxes of content showcasing what a data vault looks like.

Components of Data Vault:

Data Vault Pros and Cons:

What is a Star Schema?

Star schema data model is widely used to develop or build a data warehouse and dimensional data marts. It includes one or more fact tables indexing any number of dimensional tables. The name “Star Schema” is derived from how a diagrammatic representation looks like, with dimensions distributed around a fact table. 

A diagram with 4 boxes illustrating what a star schema looks like.

Components of Star Schema:

Star Schema Pros and Cons

How to Model Data in Modern, Cloud-Based Database Platforms Like Snowflake

The Snowflake platform was built from the ground up to address scalability and performance issues with traditional database systems. Snowflake customers do not have to worry about guessing the capacity requirements or procuring and maintaining expensive hardware. These features are mainly accomplished by separating compute and storage, which helps a modern warehouse to scale for performance (compute) and scale for capacity (storage) independently. 

In addition to that, Snowflake provides data types like VARIANT to store semi-structured data. For modern data sources like IoT devices, big data storage (Data Lakes), etc., the semi-structured data types like VARIANT can store data in its native format like JSON. These data types can help in faster Data Ingestion into either of the Data Models like Data Vault or Star Schema. For Example, a JSON document can be stored into a dimension table in a Star Schema model, as a VARIANT column. This column can further evolve into multiple columns or separate dimension tables depending on the relationships between the JSON attributes. 

Along with the inherent benefits of the Snowflake platform, it is important to implement the right Data Model which can complement the Snowflake benefits by providing effective ETL / ELT, reducing unwanted redundancies, and optimally creating Snowflake artifacts like Compute Warehouses and databases.  

Both Data Vault and Star Schema Data Models can be implemented in Snowflake, depending on the scope and objectives of the exercise. A Data Vault model is suitable for enterprise-wide or multi-department initiatives, where the objective is to build a Warehouse to store historical data and ensure consistent linkage between data entities in different departments. A Star Schema model is suitable for smaller initiatives focused on specific Analytical needs, which may require data to be structured at different granular levels. 

Data Vault in Snowflake

Star Schema in Snowflake

Should be used to build larger, enterprise-wide Data Warehouse

Should be used to build smaller, focused Data Marts for one or a few departments

Departments can define ownership of Hubs and Satellites, and Compute Warehouses can be designed according to those definitions

Star Schemas can utilize conforming dimensions, which can be defined once and used across multiple Data Marts. This can help impart consistency in business definitions across departments, and prevent the formation of Data Silos. (Eg. A Sales Data Mart and Finance Data Mart can both share a Customer dimension, leading to a uniform definition of Customer entity across the 2 business functions.)

To implement either of these Data Models in Snowflake, follow these next steps.

Step 1: Identify scope and purpose of the Data modeling exercise: A larger, enterprise-wide scope and purpose of storing and consolidating data across source systems would require a Data Vault model. A Data Model for reporting or AI/ML applications will require a Star Schema model.

Step 2: If a traditional Data Warehouse already exists (on-premise or on other Cloud platforms), then we can redesign the existing Data Model to a Data Vault or Star Schema, based on scope and purpose.

Step 3: To migrate the existing Data Warehouse to a Data Vault model in Snowflake, you can either use a Lift and Shift approach, or an Iterative Migration approach.

In a Lift and Shift approach, a traditional Data Warehouse can be re-platformed on Snowflake without any changes. Once the Data Warehouse is live, then we can begin mapping existing data entities to components of Data Vault (Hubs, Satellites, Links, Reference Tables). Snowflake can allow consistent access to both existing and new Data Warehouses at the same time. Alternatively, in an Iterative Migration approach, you can start defining a Data Vault model in Snowflake, but keep the traditional data warehouse functioning in its current state. Snowflake will host only the Data Vault model, which will continue to evolve as more data entities are migrated.

Step 4: Migration of Star Schema Models can also use similar approaches as Data Vault. In addition to designing the data entities specific to a Data Mart, you also need to consider how to link different Data Marts together. This will ensure data consolidation across different Analytical use-cases. To best achieve this, the Star Schema models should be derived from the underlying Data Vault model.

Step 5: It is also possible that an organization has a Data Lake storing vast amounts of unstructured or semi-structured data using Big Data platforms like Hadoop. Migration from these data sources into Snowflake also follows a similar process as traditional Data Warehouses, but additional effort is needed to identify relationships between different data entities, data integrity requirements, and data ownership.

Step 6: You can build both Data Vault and Star Schema models in Snowflake, from the ground up. In such cases, an iterative approach for Data Vault can be most effective. A Data Vault is not directly usable by end-users, so you can build independent Star Schema models at the same time, which can re-align to the Data Vault once it is live.

In Closing

While this blog is not exhaustive, it hopefully will shed some light on your path to choosing the right data model for your enterprise. If you have any lingering questions or need additional help designing, building, and operationalizing your data product, we’d love to chat!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit