DataOps: What Is It, Core Principles, and Tools For Implementation

When building a successful company, it’s critical to have a strategy around how you build and scale your business from a technology and data perspective. Your business likely has competitors that are trying to beat you to market, technology is constantly evolving, and so are your customers. Software engineering practices define how to reliably and effectively build software and data products, delivering value faster to your customers.

In this post, we will explore the complexities involved with software engineering with a focus on data engineering and data operations (DataOps). We’ll work through the different facets of taking your data and extracting business value with the same rigor and process companies apply to product development.

Want to Save This eBook for Later?

No problem! Just click this button and fill out the form to download it.

(You can read the full guide without giving us your email — keep scrolling!)

How Impactful is Your Data?

It’s nearly impossible to have a successful business without quality data. Every business is faced with things like:

Where do we invest our money to increase revenue?
Where are there more sales opportunities?
How do we grow our business without losing existing customers?
How do we minimize costs?
How do I know which tools to leverage?

In order to make an informed decision on any of these questions, you need data! The larger your organization is, the harder it is to answer all these questions. Why is that?

If you’re selling a small handful of products to a small group of customers, it’s easier to have a handle on what your users want and have visibility into the usage of those products. However, this doesn’t translate at scale. If you have hundreds of thousands of users using a variety of products, you need more tooling and processes at your disposal that enables a thoughtful and scalable approach.

This requires you to establish a data strategy to ensure your data is working for you, not against you.

What’s Data Strategy?

A data strategy is an evolving set of tools, processes, rules, and regulations that define how a company collects, stores, transforms, manages, shares, and utilizes data. This data may or may not be owned by the company itself and frequently requires multiple layers of manipulation to form a cohesive product or strategy. This is going to involve a technical perspective and a non-technical perspective.

From a non-technical perspective, your data strategy needs to include processes and rules around things like how business requirements are gathered and stored, where and how data is sourced, and what regulations need to be applied to the data to remain compliant.

From a technical perspective, your data strategy needs to include processes and rules about how data is manipulated, where and how data is stored, and how privileges to that data are managed. You also need to be concerned with things like availability, durability, consistency, cost, and how to iterate on a data product.

For a deeper dive into how your organization can develop a more practical and focused data strategy framework, we highly recommend checking out our free guide:
How to Build an Actionable Data Strategy Framework

Roadmapping The Data Strategy Journey

The way that data products are created and data is analyzed changes and matures over time with the business. Most companies begin by using Microsoft Excel, downloading CSV files from a variety of sources in order to clean data, perform analytics, and generate reports. If you’ve ever done this, you’re likely familiar with vlookups, pivot tables, and trying to keep multiple tabs’ worth of data straight. This is a manual and error-prone process, but one that requires the least amount of complexity.

This makes you work for your data instead of your data working for you.

If you’re in the boat of using Microsoft Excel, you’ll quickly realize that you may need more advanced tooling and processes. This commonly introduces:

Database or Data Warehouse
API/EDI Integrations
ETL software
Business intelligence tooling

By leveraging off-the-shelf tooling, your company separates disciplines by technology. Instead of manually working with the data, software, and data products, extract the data to unlock the ability to generate insights. However, as your business scales, and you’re confronted with data from more sources (or providing more data for your customers), the need for data transformations, preparation, and analysis grows beyond just reporting.

The separate disciplines begin to slow down due to the complexity and dependencies that have resulted over time. They no longer understand the business context of goals from the data products. The tools do not fully support software engineering principles and the following questions begin to take root:

How do I introduce new functionality without impacting existing users?
How do I ensure customers aren’t impacted by changes or new functionality?
How can I manage millions of reports?
How do I know where this data came from or how it’s being used?
How do I maintain all my data pipelines?
How do I recreate the environment and data sets from scratch?
How do I manage changes and review the outcomes?
How do I build confidence and trust in the data products I create?
How do I validate that my data is accurate and complete?
How do I manipulate my data to gain additional insights and empower my customers?

Download the DataOps: What Is It, Core Principles, and Tools For Implementation eBook

How Do Software Engineering Principles Solve This?

Software Engineering principles apply to the complete data product lifecycle. There is no silver bullet or one piece of technology that will solve these issues. The whole development process needs to be examined. While there are a variety of definitions and steps for the software development lifecycle (SDLC), for the sake of this post we’ll define the steps as the following:

Concept
Inception
Iteration
Release
Maintenance
Retirement

Within each of these steps, there’s significant complexity, and many companies don’t define how they’re going to approach these before building their products. However, defining how each of these is going to be accomplished is core to any software engineering product/project.

Product Development

When conceptualizing and incepting a product, one of the core roles needed to be successful is a product owner. The product owner is responsible for envisioning what the product/project would look like and accomplish, and then determining the team members and funding involved. The product owner needs to frequently re-evaluate how the product is meeting customer needs, gathering customer feedback, and introducing updates to the product based on that feedback. A successful product owner will consistently be involved with prioritization, communication, and delivery of the requirements for the product.

However, defining how the work is executed and requirements are gathered and executed varies. Historically there have been two very common frameworks used for this: waterfall and agile.

Waterfall Framework

Within a waterfall framework, all the requirements and deliverables are defined upfront. All the customer research, features, and business requirements determine how the work is broken down into chunks and executed. These chunks of work are executed until the project is complete without any changes during the development lifecycle.

The problem with this is technology is ever-evolving and business needs frequently change. Within a waterfall project, it’s strict in nature and doesn’t allow for pivoting based on external factors. You may end up building a car for your customers when they really need a bicycle. This may lead to overspending and customers not using your product.

Agile Framework

Agile development has become the modern standard for software development and phData’s recommended framework for managing work and delivery of data products. While there’s still an initial concept and inception of the team, the framework works on deliverables in a smaller time frame. This increases speed-to-market as less up-front analysis is required and allows you to get your product in the hands of customers for feedback.

The incremental changes encouraged by the agile methodology have dramatically changed how we build software and data products.

Why Is This Challenging?

For many businesses, adopting a full software engineering approach with frameworks like agile can be daunting. However, almost every business operates within an agile framework in some capacity. Your business likely started with a target customer and product in mind, and then you adapted along the way.

If something is painful to accomplish or there’s an issue with your product(s), most businesses will seek out a solution. This could be finding a better accounting solution, hiring a lawyer instead of paying a law firm, or expanding to a larger physical location. These are easier to solve as the pros and cons are much simpler to calculate.

When it comes to software engineering, changes can be harder to quantify. If you add a feature, how do you know that customers will use it? When is it more profitable to use method A vs. method B? How do I ensure that I can make changes to my application without impacting users and how do I handle disaster scenarios where my system is unavailable? How do I know that my data is accurate and complete?

The ability to answer these questions is crucial for businesses but comes with a cost.

You have to have talented engineers who can design and implement systems to ensure that your product remains highly available.
You need a method to seamlessly introduce changes, which costs more money than simply copying and pasting your build code to a server.
You have to convince executives, using existing data/resources, that the changes to the business from a technology perspective will drive future value or business.
You need additional resources to monitor and validate that your system is working as designed.
You need resources available to triage if something happens and build out disaster recovery implementations to failover in the case of an outage.
You may need to build another data center or work with a cloud provider to have another set of infrastructure to failover to.

All of these things are expensive but crucial to being successful.

What Are The Components of a DataOps Strategy?

So far we’ve talked about software engineering principles in broad strokes, but what does that look like specifically with data engineering and a DataOps strategy? How do I reliably build data products?

We will break each of these down individually, but at a high level:

Source control management
Infrastructure as code
Build/Deploy strategy
Continuous integration/delivery
Data quality validation
Workflow management
Data modeling
Monitoring and logging
Alerting
Business continuity

You may not need all of these to be successful, but the more you can tackle, the stronger your software engineering practice will be. Each of these addresses a core functionality that integrates with the incremental development and maintenance structures in your SDLC.

Source Control Management

Many proof of concepts (POC) that companies build start with somebody building a script or application on their computer. This script or application may be deployed to a server and used by customers. While this works to some degree for teams with one engineer, this has a lot of risk and doesn’t scale.

Consider the following scenarios:

The computer fails and all the source code is lost
A bug is introduced and you want to re-deploy the previously stable code
You have multiple team members working on the project
You want the ability to view incremental changes or revert a specific change
You want the ability to review changes before they’re introduced to your codebase
You want to be able to save (tag) versions of your application

This is exactly what version control aims to solve.

Let’s look at some examples of version control.

Subversion (SVN)

While we do not recommend using this, historically speaking, SVN was the default choice for version control for a very long time and is still used by some companies. Now part of the Apache Foundation, it originally was developed by CollabNet, Inc. and is a centralized repository where all files and data are stored on a central server.

Changes (commits) are introduced directly to the centralized repository from each developer. To separate work, branches are used to add functionality without affecting the main codebase. These are merged into the main branch after validation occurs against the branch. The major downside of SVN is that a developer is required to be connected to the internet to commit work or make changes to the repository. It’s also much more limited in how you can work with other developers.

Git

Git (not to be confused with Github) is similar to Subversion but consists of remote and local repositories. This means you have a central repository (which typically is deployed to an environment) and a local repository (which is used for local development). The biggest gain with using Git over Subversion is that your developer’s branching and tagging can be separate from the central repository. Developers are free to do whatever they want in their local repository and once they’re ready, they can introduce changes to the remote repository. This allows each developer to work with their own local repository without affecting each other.

The other major advantage is offline access. You’re free to manipulate your local repository however you see fit and push those changes to the remote repository when you’ve reconnected to the internet. With SVN, you wouldn’t be able to commit any changes or create any branches without internet access.

Git is by far the most popular source control management platform currently available and is our recommendation.

The rest of this section will focus on Git specifically.

Advantages of Source Control Management

Now that we’ve talked about a couple different tools for managing your codebase, let’s talk about some of the specific advantages of using version control.

Pull Requests

One of the biggest challenges when you have a team of developers working on a project is silos of knowledge. This is when a particular developer or set of developers has deep knowledge of a particular area, but might not be the one executing any updates or changes to that same area.

When a developer wants to introduce changes to the remote repository, they copy the remote repository to their local machine and make changes to the code base. When the developer is finished working on that particular area, they’ll create a pull request back to the remote repository. This allows the team to audit the changes and request any modifications via what’s known as a code review. This allows knowledge to be shared within your team and ensures best practices are being followed.

Many platforms like Github and Bitbucket (which use Git) also have the ability to perform actions against a pull request or changes to a branch. This could be running tests, audit checks for compliance/security purposes, and ensuring coding standards via linters.

Change Management

When changes are being introduced to your repository, you want control over where that code belongs and what is used for deployments. It’s also important to know what the delta is between your current code base and your deployed codebase. This informs you as to what changes are new and what is existing. Many companies handle this via feature branches. These are a separate branch used for introducing a feature or set of features, and once finished/validated, it’s merged into the main branch.

Alternatively, you may have already merged some code into your main branch and have identified that your application is having issues. Version control provides the ability to revert or back-out changes, allowing you to confidently fail back to a previous version of your product.

Challenges with Source Control Management

While the pros heavily outweigh the cons, it is important to talk about the challenges associated with version control.

First off, you have to define a branching and tagging strategy which takes time to do well. How are you going to be introducing changes? How do those changes get released? How do we know when to merge between branches? How do we ensure the code is validated before we’ve released it?

Also, you need developers with knowledge of how to properly manage your repositories. This is much more simple with SVN but has less flexibility and more risk. With Git, you have full control over the remote repositories without having to worry about local development (if set up correctly). There’s also additional complexity if using forks and having multiple copies of your remote repository in different locations. You also have to be concerned about merge strategies (merge commit vs squash) in order to be able to revert changes confidently.

There’s also potentially additional cost. If you’re hosting your version control on a server your company owns, this will likely be minimal. Many companies go with cloud providers to ensure availability, reduce infrastructure management, and reduce their security footprint.

phData has supported and implemented several strategies in the space and can work with you on what works best for your engineering and environment.

Infrastructure as Code

Within your DataOps strategy, one of the key components has to be consistency, reliability, and performance. Not only should your applications and data be reliable and consistent for consumers, but the pieces that execute your code must be as well. In the event of an outage, it’s important to be able to understand what changed within your technology suite and remediate issues promptly. Time is money, right?

Infrastructure as Code (IaC) is an important piece of the puzzle. This is the process of having your infrastructure defined inside of templates or configurations that are then deployed via scripts or services to your hosting provider. This could be on-premises servers or cloud providers. Most frequently IaC is used in conjunction with cloud providers or cloud services.

If your company uses AWS, you likely have at least two AWS accounts that you have provisioned for your environments. To keep it simple, let’s say we have one account for our development environment and one account for our production environment. Without IaC, developers either use the AWS console or command line interface (CLI) to provision or modify resources inside of that account. It’s very common that resources and applications will be stood up in the development account, then the developers will promote the application to production, and BOOM everything in production suddenly doesn’t work. What happened? Somebody forgot that they added a permission or resource somewhere that was required but missed due to human error.

Alternatively, what frequently happens is it’s difficult to audit what resources were added by who and if they’re still needed. Maybe an EC2 instance was created or an IAM role was created in AWS, and then the project was abandoned.

Instead, we want to define our resources in code, create pull requests and perform our security and audit checks, and then deploy our new infrastructure based on our build/deploy pipelines we previously built for our application code. This introduces consistency in how we not only introduce changes to our applications but also our infrastructure.

We can even combine these resources into a single repository to ensure that our required infrastructure is deployed when we deploy changes to our code!

There are many tools that can be used for this. Let’s take a look at a few of them.

Terraform

Terraform is among the most utilized tools on the market for IaC. This is a config driven tool that is made by HashiCorp and is supported by over 1000+ providers such as:

AWS
Azure
Google Cloud
Oracle
Alibaba
Okta
Kubernetes

As you can see, there’s support for all the major cloud providers and various other auxiliary tooling that enterprises frequently leverage. This is very important as it enables your enterprise to define resources consistently regardless of the technology or vendor. By having one consistent pattern for defining resources, it complements the information architecture practices within your DataOps strategy

Terraform provides the ability to create reusable modules within your configuration and take in parameters. For example, let’s say that you want to ensure whenever an EC2 instance is provisioned on AWS that a CloudWatch alert is created as well to alert if the CPU goes over a certain percentage. You could also limit what sizes developers are allowed to use for an EC2 instance or default properties that you want every EC2 instance to have.

CloudFormation

If you’re only planning on using AWS, CloudFormation is another common option for describing and deploying resources into an AWS account. This also uses a config file approach (json/yaml) and has the ability to take in input parameters similarly to Terraform. Unlike Terraform though, CloudFormation is free to use.

Instead of modules, CloudFormation has the concept of nested stacks. In other words, you can create reusable pieces of CloudFormation that you inject into other CloudFormation config files.

Since CloudFormation is native to AWS, you aren’t reliant on a third party to maintain mappings to configuration properties to AWS resources. However, you will lose some functionality provided by Terraform that AWS doesn’t support such as viewing stack differences in pull requests before merging them. Instead, you have to create a change set and then apply those changes after reviewing within AWS itself.

Cloud Foundation

Cloud Foundation was built by phData to provide a CI/CD process around IaC. It has deep support for CloudFormation but also supports AWS CDK and terraform. Cloud Foundation is built using Sceptre and Troposphere which allows you to build reusable template files with type checking that generates your CloudFormation template automatically. This free tool also gives you the capability to post-stack differences to pull requests and audit changes to your environment in a predictable manner.

With Cloud Foundation, rather than writing your infrastructure-as-code entirely from scratch, you can draw from our AWS CloudFormation library of production-ready “Gold Templates” (ready-to-go infrastructure patterns developed by phData and deployed successfully into production by our customers), as well as build scripts to facilitate a streamlined CI/CD pipeline.

phData Cloud Foundation is dedicated to machine learning and data analytics, with prebuilt stacks for a range of analytical tools, including AWS EMR, Airflow, AWS Redshift, AWS DMS, Snowflake, Databricks, Cloudera Hadoop, and more.

Build/Deploy Strategy

Now that we’ve talked about how to manage your code within repositories, we need to talk about how code is taken from source and built into productionized versions/artifacts. This build process generally includes:

Compilation/Transpilation
Minification/Uglification
Versioning/Publishing artifacts to a repository
Containerization
Automation

Let’s dive deeper into each one of these.

Compilation and Transpilation

Compilation is the act of taking a high-level programming language (or tool) and translating that to something a computer can execute. Depending on the programming language and tooling you have chosen (or are required to use), this will look different and may or may not be required. A common example of this would be taking a Java project and building that into a jar file. This jar file can then be executed by the Java runtime on any server with a compatible Java version.

Transpilation is a similar process, however, this includes going between programming languages (or versions). This is also known as a source to source compiler. A common example of this would be JavaScript ECMAScript 6 (ES6) to Javascript ECMAScript 5 (ES5). Since older browsers cannot parse ES6, we need an additional step in our build process to translate our code to something the browser can execute.

These steps are not always required. For example, in many data engineering projects, developers are likely going to use Python and templating languages like YAML and Jinja templates. This is a scripting language that does not require any compilation or transpilation (generally) in order to execute on a machine. You do however need to ensure that the Python script you’re running is compatible with the version of Python installed on your machine.

Compilation can also be used for validation purposes. Tools like dbt and Tram have compile/dry-run capabilities where the output of the application is shown but not executed against. Tram for example has a dry run capability where you can see all the changes that would be applied to your Snowflake environment if you were to do a full run. This allows you to audit and review those changes in processes like pull requests.

Minification and Uglification

Minification is the process of reducing the size of your code into the smallest possible size. This process renames variables, functions, objects, etc. to 1-2 letter short-names and removes any unnecessary whitespace. This reduces the download size of your resources and increases performance for situations where your code needs to quickly be accessible by a third party.

Uglification is the process of making your code as non-readable as possible to prevent consumers from reproducing your codebase. This is very common when you’re providing a library for a third party, especially if it’s a paid library. You don’t want your competition to be able to replicate your work!

Containerization

One of the biggest headaches that developers face when building software products is the differences between machines. You might have heard the term “works on my machine!” thrown around as a joke. If not, imagine a developer codes out a product on their local machine. They think their code is ready for their development environment server and they deploy their code only to have errors.

This is where containerization comes in. This was made more well known by Docker, however, they weren’t the first to provide containerization capabilities. Simply put, containerization is the process of bundling your application and its dependencies into a single resource so the application runs quickly and reliably from one computing environment to another. This generally includes code, runtime, system tools, system libraries, and settings required by the application. This is then run in isolation (mostly) from the machine executing the container to ensure reproducible results and avoid environment server differences.

These containers are also versioned! You can tag default (latest) containers along with specific versions of your application. This allows you to quickly switch between deployed versions of your application in your environment and fail-back to previous versions in the event of a failure.

Publishing and Versioning Build Artifacts

Once your application has been compiled and bundled with all the required libraries and resources, it’s recommended that you publish these artifacts to a remote repository. These repositories could be entire applications in the case of Docker or could be tooling that you’ve built for a shared library that you use in your codebase.

With data engineering, it’s highly recommended to create shared libraries that tackle consistent functionality that you need across projects. This may include:

Data standardization
- Date/Time formatting
- Defaulting values
Configuration management
Connection management
Permissions management
Logging setup/functions
Setup/Teardown tasks

These functions inevitably evolve and the consumers of your code will need to update to handle these changes. Having a repository that contains and makes available all the versions of your code for consumers is key to being successful with this.

Common tools that are used for this are Artifactory, Docker Hub, AWS CodeArtifact, and Azure Artifacts.

Automating Your Build

Ideally, whenever changes are introduced to your Git repository, your code would be built and published to your artifact repository for consumers to immediately use. Generally speaking, you’ll have a dev build and a production build where dev is used to validate new functionality before providing it to consuming applications.

Depending on what Git provider you’re using, your options will change but generally speaking, these are common pairings:

Github -> Github Actions, Azure DevOps, or Jenkins
Bitbucket -> Bitbucket Pipelines or Jenkins
AWS CodeCommit -> AWS CodeBuild/AWS CodeDeploy

These tools provide hooks that trigger events based on changes in your Git repository and build scripts to automate tasks based on those changes.

For example, a pull request has been opened against your develop branch of your Git repository. You want to ensure quality checks and audits are run against that code. You can create a hook for pull requests in say Github Actions that will run whatever checks/audits you determine are required. Then, when the pull request is merged and a commit is made to your develop branch, you can run your same checks/audits to double-check everything and then build/publish your artifact to an artifact repository.

When deploying dev builds, it’s recommended that you either publish these to a separate artifact repository from production or explicitly name the artifact with a dev prefix/suffix. This ensures consumers don’t accidentally pull in dev releases.

Continuous Integration and Delivery (CI/CD)

Now that you know how to publish builds to an artifact repository, it’s time to talk about releasing those artifacts to an environment. This functionality is generally built by a mixture of development operations (DevOps) teams and product teams. DevOps is responsible for building and setting up new development tools and infrastructure that product teams leverage to build and deploy their products.

Continuous integration is the process of building and testing components as changes are introduced into the repository. This new code is “integrated” into the existing code with the key goals of finding and addressing bugs quicker, improving software quality, and reducing the time it takes to validate and release new software updates. In the previous section (Automating Your Build) we discussed some common examples of how this is set up.

Continuous delivery on the other hand is a software development practice where code changes are automatically prepared for a release to production. This expands the scope of continuous integration to also include another layer of testing before releasing functionality to production.

Best practice is to have automated testing that ensures the new changes haven’t broken existing functionality and runs as expected within the new environment.

This may include UI testing, load testing, integration testing, API reliability testing, etc.

Let’s take a look at some different types of testing and deployment strategies in depth.

Unit Testing

We’ve mentioned unit testing a couple of times, but what exactly is it? The goal of unit testing is to test individual pieces of code at a very low level. This generally means ensuring individual functions and pieces of functionality provide a given output and set of inputs.

In a simple unit test, you may have a function that takes two strings as input, concatenates them, and returns the result. This function might be used to format your logs consistently. Your unit test would ensure that when this function is called with two predetermined inputs, we get the expected output.

While this is a very simple example, in practice, your unit tests will be more complex. You might have a function that takes in a SQL query and a connection, then validates the following:

Log messages are printed before executing the query
Permissions are checked
The query is submitted to the SQL connection
The result is returned

In cases like this, you’d want to mock or create a fake implementation for the connection and the query engine. You’d limit the scope of your testing to just the code you have written rather than adding in all the environment resources and third-party libraries.

Integration Testing

Where unit tests and integration tests differ is when your testing is at a higher level, including environment resources and relevant third-party libraries. The goal of integration testing is to ensure that your code “integrates” with the environment it’s being tested within.

Let’s say that your code takes some input string and saves it as a file in AWS S3. Within your unit test, you’d validate that your code is establishing permissions and calling boto3’s put_object function to upload the file to S3. This test wouldn’t actually upload something to S3, but rather would ensure the function calls the library as expected. With an integration test, you would allow your code to upload a file to AWS S3, and then you would validate that the file and its contents are what’s expected within AWS S3.

Integration testing is critical to ensure that your code hasn’t caused a breaking change within your environment. Within most enterprises, your code is likely part of an entire chain of processes that need to happen to provide functionality to your customers. Integration testing allows you to identify if your new code works in that entire chain. You may have changed a property being sent to another process that causes a breaking change and would have been missed by unit testing.

Integration testing is challenging because you’re going to be submitting test data into your live environment. Some companies will have a test environment that they run integration tests in and then deploy their assets to production, however, this doesn’t test your production environment itself with the new changes. It’s very common that every environment has slight differences in configurations, sizing of clusters, and number of resources available.

Therefore it’s highly recommended to run integration tests in your production environment in addition to any lower environments.

You will need to build additional capabilities within your environment to handle this test data and clean up after the tests are complete. For example, if your application creates orders, you will need to delete these orders and any associated records at the end of your tests.

Production Deployment Strategies

Now that we’ve covered different types of testing and the strategies within them, we need to talk about how deployments occur to minimize customer impact. If users are currently operating your system, or you have services/processes that are running, you can’t kill your application to swap in the new code without interruption.

Depending on what tooling you’re releasing updates for/with, this may not be an issue. However, generally speaking you have two options.

Update In Place

Every deployment strategy is going to have a list of pros and cons, so let’s walk through them.

Pros

Simplicity
Reduces infrastructure costs

Cons

Either have to run integration tests post-deployment (and have a fail-back strategy) or have to run integration tests separately from your prod instance
User/Job impact as you have to take your product offline temporarily
Need to ensure your architecture is de-coupled to prevent upstream errors

For many reasons, this architecture is not recommended for enterprises but may be okay for infrequently changing applications that aren’t used by a variety of processes. Instead, you should use blue/green deployments if possible.

Blue/Green Deployments

In this strategy, you instead have two instances of your application in your production environment.

One is active (where all requests are routed to it), and the other is passive (not actively used). Generally speaking, these are on the same hardware/cluster to avoid any system-specific differences. Once the application has been validated, you then update your DNS entries and/or load balancers to point at the validated instance that has your new code.

Pros

Allows you to run integration tests in production before fully releasing
Allows you to debug environment issues without impacting users and processes
Seamless deployments for consumers

Cons

Complexity
Infrastructure costs
May not be feasible depending on the technologies involved

Since you have to manage two instances of your application or process, you’ll need to route traffic to them accordingly, and ideally have automation that does all of this for you. There’s a very large upfront cost that goes into supporting this architecture.

Enterprises frequently struggle with value propositions to executives as the cost of engineering is very clear, but the cost to the business or its customers when the system is down may not be as clear. However, it’s critical to maintain as much uptime as possible for customers and minimize the risk when making changes to an existing system.

Blue/Green Deployments with Data Products

While the above process may be applicable depending on your situation, frequently, this is really changing to implement with data products. The version of data that people use regularly will be within the same server, so changing connection strings and DNS entries isn’t as practical. However, we can apply these same principles to data products with a slightly different approach.

Let’s look at one way to accomplish this.

One of our customers needed the ability to export/import data between systems and create data products from this source data. This required applying transformations and filters to the data for various business units. The data was being stored in their data lake (AWS S3) and within their data warehouse (AWS Redshift). Views and materialized views were leveraged to apply the required joins and filters to the data. Since systems like AWS Redshift don’t natively support slowly changing dimensions or schema changes, we needed the ability to version our data that was live to customers without breaking our data pipelines.

To guarantee that the latest version of a table was used when the views and materialized views were created, we used a templating library (Jinja) that would reference our configuration files for each dataset/table we had and programmatically generate the required data definition language (DDL). This configuration file (yaml) looks like the following:

				
					schema: my_schema
view: my_view_1
query: |
 select md.column, md.active, ad.active
 FROM {{ MyDataset }} md
   JOIN {{ AnotherDataset }} ad ON md.column = ad.column
 WHERE md.active is true
   AND ad.active is true

This configuration is read into our python code, the variables {{ MyDataset }} and {{ AnotherDataset }} are replaced from values in another configuration file, and the views are recreated with the resulting SQL statement.

This allows us to create new versions of our data sets, populate them with data, validate our data, and then redeploy our views on top of that data to use the new version of our data. If there’s an issue with the new version, we can easily swap our configurations out and fail back to a previous version.

Data Quality and Validation

This is one of the trickiest parts of a DataOps strategy and requires a lot of input from those responsible for data governance. We explore data governance in-depth within our blog series, but foundationally there needs to be rules and practices established that define what data quality means to your organization and how your data is tracked/transformed from raw to curated.

While many factors of data governance and data quality will depend on your organization’s approach to this topic, there are some common factors that every organization should consider.

Information Architecture

When organizations start on their data strategy journey, it’s very common that processes, jobs, data structures, and other architectural components are added as needed to existing resources. These existing resources, such as a database, typically have things added to them as new projects occur until an incident occurs. This is generally when businesses start re-thinking their information architecture, but there’s much more complexity than simply what database tables go into what database.

At its core, information architecture is centered around how data is:

Organized
Structured
Labeled
Discoverable
Standardized
Formatted

Developers, data scientists, and other consumers need to quickly be able to find, understand, and utilize the data within your system. Databases, schemas, tables, view, etc. should all be well named and structured in a way that your organization has deemed appropriate for its consumers. There should be documentation on how these are structured and where to find different areas of data within your organization.

phData recommends Flyway as a tool for managing changes to data objects using engineering principles. Changes are applied as part of the deployment workflow.

Data Profiling

When working with data, you’re generally going to have changes to the system flowing in and out. This may be changes to your applications, data updates from various sources, or event data that is submitted to other entities inside or outside of your organization.

One of the worst things that can happen in a system is missing data. Maybe you have a bunch of sales data and are missing a $1,000,000 transaction, which causes you to misreport your financials to stakeholders and your stock price to drop. Maybe you end up sending a customer an incorrect bill. These situations can cause significant issues within your organization.

If you’re receiving data from another system or company, you need to be able to trust that the data is correct to avoid miscalculations.

This is where data profiling comes into play.

Data profiling is the act of processing, analyzing, and extracting information about your data. This is frequently done with some level of statistical analysis calculating standard deviations to identify irregularities.

With our two previous examples, how would data profiling help? Well, if you know that your sales are generally a certain number each month, you could identify this month’s sales as being outside the bounds of what your normal amount is and identify stakeholders before people access the data at the end of the month. This proactive approach to data validation allows you to minimize risks and get ahead of the issue.

If you’re sending or receiving data from another system or company, you can profile the data before you ingest or send it. If the data appears incorrect you can set it aside for further review and/or alert stakeholders.

Data profiling allows your business to proactively monitor your data quality and notify you of any irregularities.

Data Profiling In Practice

There are many data profiling tools available. For simplicity, we’ll focus on two: Deequ from AWS and Great Expectations from SuperConductive.

Deequ is an extension of Apache Spark that allows you to write unit tests against your data. You can establish a set of tests that you want to run on either existing data or new data coming into your system, and automatically set aside and alert on any data that doesn’t adhere to the tests. Deequ can also perform anomaly detection based on absolute or relative amounts of change to your data values.

Great Expectations also utilizes either Apache Spark or a Pandas dataframe and allows you to write tests against your data. Great Expectations also auto-generates documentation about your assertions in a nice readable format and can integrate with Apache Airflow to enable data validation within your data pipelines.

Both of these tools follow the same general workflow:

Ingest a set of data
Run existing rules or generate rules against your data
Calculate statistics and compare those to the expected results
Store the results of step 4
Take action based on the result

Batch vs Streaming

Another complexity to consider when talking about data validation is the frequency at which data arrives into your system. You may have a batch process that’s ingesting a CSV file on a cadence like once a day, or you may be ingesting data in a streaming architecture from a Kafka cluster. The way you validate your data will be greatly influenced by your situation and architecture.

We recommend identifying sync points that align with your information architecture so that data currency expectations are known at a governance level. This helps drive requirements and determines the right validation at the right time for the data.

Workflow Management

Now that we’ve covered how to build and deploy application code and infrastructure, let’s talk about actually developing data products. When moving data between systems, there are several different ways that data can flow through a system. There’s often a need to orchestrate how that data is loaded and manipulated in order to build a final data product. Next, there has to be some sort of cadence to how often we need to run those orchestrations against our data. These are all critical to ensuring that your product is consumable and available to consumers.

Batch vs Streaming

We briefly touched on this earlier, but let’s dive into this a bit more. One of the first things you need to identify when building a data product/architecture is how frequently you need to consume or update data.

The two most common approaches for data ingestion and processing are batch and streaming. There are significant pros and cons to both approaches. Generally speaking, the more real-time you want your data to be, the complexity goes up exponentially.

Usually when talking about batch or streaming, you end up talking about three things:

Performance
Cost
Availability

Let’s dive into how these look within a batch and streaming architecture, and then we’ll talk about some tooling available for these.

Performance and Availability

When talking about performance, it’s important to articulate exactly what is performant. There are many pieces to a data pipeline and depending on the context, things can take on different meanings. In this case, we’re talking about data performance or the time that it takes for data to become available for end consumers.

In a streaming architecture, data is processed as fast as possible. There are many different common streaming architectures such as:

Publish/Subscribe or Pub/Sub
Message queues
RabbitMQ

These all have their own pros and cons which we won’t get into, but they all work to identify individual things that have changed within your system.

Let’s say an order was created or updated. This event would be published to one of these architectures and then your application would be responsible for consuming it and processing the data. This allows you to work on small pieces of data and process things as an individual unit of work.

If your data is tightly coupled or processing requires having multiple records, then streaming can become problematic. If every single order update needs to go and query your database for the full order details or additional details, you can quickly overwhelm your database and need to look into caching options. This also introduces complexity with how performant your code needs to be when it’s processing the orders. Too slow and your system becomes backed up and consumers aren’t seeing their data.

Alternatively, a batch process usually focuses on the data processing performance rather than data availability. For example, let’s say we batch the last hour worth of orders and we need to process that data. We can quickly identify all the order numbers we need, submit one query to our database with all the order numbers in our batch, and get one response from the database with the required information. This significantly reduces network traffic, database load, and allows other applications to utilize the resources of your infrastructure without heavy external load. That is to say, your overall infrastructure performance may be higher by utilizing a batch architecture at the cost of data performance to consumers.

You will need to weigh the pros and cons of each design and decide what is best for your use case.

Cost

As with anything in life, there’s always a cost to the decisions you make. If you want something quick, the cost of that thing almost always increases. This is also generally true with a streaming architecture.

Within a streaming architecture, you need to have the ability to process data at any given moment. While there are increasingly more serverless options available from various cloud providers, you’re going to need some sort of computing resource that’s always available to process messages from the provider of your choice. Serverless is convenient as you don’t have to ensure a server or broker is always running, that convenience also comes at a cost. You also will be making more frequent writes to whatever storage layer you decide to save data into.

With a batch processing approach, you’re only using resources for a set time. Between batches, those resources can either work on other tasks or can be shut down depending on the situation. You could also use serverless processing here, but generally speaking, serverless is better for short-running tasks.

Batch processing will likely be cheaper than streaming at the cost of data availability/performance.

Orchestration

A common requirement within a data pipeline is orchestration, or the ability to control what tasks happen in what order. These could be many different types of tasks from ingestion to cleanup to data manipulation for building machine learning models.

A common tool that’s used for orchestration is Apache Airflow. Initially created by Airbnb and later open-sourced via the Apache Software Foundation, this tool allows you to define a directed acyclic graph (DAG) that details out tasks, their dependencies, the order in which they execute, and retry/failure/success functionality at both the task and DAG level.

For example, let’s say you are receiving CSV files from a third party. You need to ingest that data into various locations and perform some auxiliary cleanup and maintenance tasks on that data. You could have one massive job that you run that performs all of these actions, but that’s harder to maintain long-term and difficult to de-couple.

Instead you can break those tasks into smaller chunks and use an orchestration tool to run these tasks sequentially on your system. One task could be ingesting the CSV files into your data lake and data warehouse, then a second task to process any deleted records, and then a third task to run compaction and vacuuming on your data to increase performance.

Building Orchestrated Workflows

When deciding to build an orchestrated workflow, there are many complexities that you have to consider such as:

How do I organize my tasks?
Which tasks are dependent on previous tasks or previous iterations?
Do all of my tasks need to run successfully on each iteration?
Is each iteration idempotent?
How should I handle failures?
Where and how do I capture metrics?
When should I use orchestration infrastructure resources vs external resources?
How do I authenticate within my orchestrations?

Depending on your business requirements, you may not need to always successfully ingest updates to the data in order to run the “ingest deleted keys” task but do require it to run the “validate data” task. When running the ingestion tasks, you’re likely going to use external resources such as AWS Glue to load the data into various target systems.

In a more simple example, you may be able to leverage the Airflow worker and just execute a copy command to AWS Redshift against your data in S3. You may use the Airflow worker within the data validation task to execute queries against various systems to validate your data. You may find that you want to do data reconciliation as part of your data validation task and now need to move your compute resources from the Airflow worker to AWS Glue as well.

Even with simple orchestrations with simple goals, it’s important to understand the different options and tradeoffs.

Data Modeling

Specifically concerning data engineering and DataOps, you’ll need to determine a methodology to format, process, and model your data. The requirements to manage data differ based on how data is generated, stored, and retrieved. The end goal of data modeling is to illustrate the types of data stored within the system, the relationships between them, the way the data is grouped/organized, and its formats and attributes.

A good data model should address the considerations for the specific stage of the lifecycle it is being implemented for. At a high level, a typical data lifecycle looks like this:

There are several aspects to consider when choosing the right data model. These aspects vary based on the stage of the Data Lifecycle we are designing for. These factors are as follows:

Speed and frequency of data creation and modification
- Small amounts of data should be written to disk faster while maintaining consistency
Speed of data retrieval
- Small or large amounts of data retrieval for reporting and analysis
ACID properties
- Atomicity, Consistency, Isolation, and Durability of transactions
Business scope
- One or several departments or business functions
Access to the lowest grain of data
- Different use-cases for data may require access to the lowest level of detail or various levels of aggregation

You can find more information on data modeling in this blog post.

Monitoring and Logging

Some of the worst things that can happen to a business revolve around unknown failures within their applications and data pipelines. It’s much easier to address an outage or failure if you notice them before your customers. Your customers need to feel confident that you’re providing the service they need; if they’re the first to find issues, it’s going to degrade their trust.

Monitoring and logging is one of the most situational components of a data strategy and will largely depend on what your specific requirements are. You’ll need to consider:

Data colocation and availability
Frequency of logging and what is logged
Do your logs trigger any monitors or are they used for debugging purposes?
Does your company already have an application performance monitor (APM) solution? Does your solution integrate with this or do you need to build a custom log pipeline?
What do you monitor within your data strategy and what triggers alerts?
What method of communication are you alerting via (email, Slack, text, etc)?

All of these questions are part of an evolving data strategy. Frequently, businesses will start with an understanding of what is critical to their enterprise and will build logging and monitoring around critical business functions. Over time, as incidents inevitably occur, these processes will be expanded upon to try to avoid similar incidents in the future.

What Should I Log?

You know that feeling when you’re in your home at night and the power suddenly goes out? Your heart is racing and you’re stressed about when the power is going to come back on. Everything is pitch black, you suddenly forget how your home is structured, and you need to figure out how to get to a light source so you can see. Sound familiar?

This is what debugging a production issue or outage without logs feels like.

The main purpose of logging is to be able to see and track what your system is doing in the event that something happens. This allows you to see where your program was last running and what data you made visible in your logs in an attempt to “find a light source” or path forward in resolving the issue.

An example of good logging that would help in an outage would be something like the following:

Logged that a function was called
Logged the input to that function
Logged that an exception occurred within that function and the stack trace

This would enable you to understand where your application was when the issue occurred, what the error was, what function it happened within (if you don’t have a stack trace that shows you this), and what your function was called with. This should give you all the information you need to start attempting to reproduce this issue and find the root cause of the failure. It should also give you enough information to build a unit test or integration test in your code to avoid this issue going forward.

It’s also important to make a distinction between application logs and infrastructure logs. The code you’ve deployed will need to have the ability to be triaged in the event of a failure, but it’s common to have infrastructure issues unrelated to your code. This could be things like connectivity issues, issues with scaling configurations, or changes to permissions. You’ll want to collect logs for both infrastructure and applications to fully understand any impact to your system.

Where and How Should I Log?

This is going to greatly depend on whether you’re self-hosting, using a cloud provider, what tooling you have available, and whether you want to perform any additional analysis on your logs. This may include things like AWS Cloudwatch, Azure Monitor with log analytics, an application performance monitor (APM) tool, or log forwarding tooling.

Let’s take a look at some examples.

Docker Containers

We previously talked about containerization and some of the utility and consistency it provides. But how would one go about logging or collecting logs from a docker container?

Docker provides in-depth documentation on the different options, but at a high level, you need to determine whether you want to collect logs locally, mount a volume that the logs are written to, forward them to another toolings like AWS Cloudwatch or Splunk, or have a docker daemon running that collects the logs. You can either configure this at the Docker container level or the Docker level for all containers.

Forwarding the logs to additional tooling is beneficial as you can then collect logs from a variety of sources and perform analytics across multiple applications or platforms. This may require additional resources to standardize your log formats.

AWS CloudWatch

This is going to be the default answer for where to send logs if working within AWS. This is a native service provided by AWS that provides a suite of tools for monitoring and observability built for DevOps engineers, developers, site reliability engineers (SREs), and IT managers.

CloudWatch also collects metrics from your applications and infrastructure that you can use for alerting and management of resources. This includes things like scaling out or in your cluster when a certain CPU percentage is hit, scaling resources when a queue is over a certain size, or alerting when your account’s bill reaches a certain threshold.

CloudWatch can even be used as a central repository for sending logs from on-premise or cloud providers. There is also support for a variety of third-party tools that can send log messages and metrics from CloudWatch to other services for further analysis such as New Relic or Datadog. You can also stream your metrics to tools such as Apache Kafka.

CloudWatch also gives you native support for creating dashboards from your metrics. These dashboards give you a quick insight into what’s happening within your organization, identify any issues that need to be addressed, and easily monitor your applications, infrastructure, and costs.

By monitoring your application metrics and logs, you can identify any anomalies in server activity, identify denial of services attacks, and have your infrastructure automatically react to changes in usage.

Azure Monitor with Log Analytics

If you’re using Azure infrastructure, then you’re going to likely be using Azure Monitor combined with Log Analytics to collect logs from your applications and infrastructure. Similar to CloudWatch, Azure Monitor gives you the ability to create dashboards, collect logs from various sources, perform analytics against them, and forward your logs to other locations.

If you’re using Microsoft tooling on-premises, Azure Monitor also gives you the ability to seamlessly integrate logging from your on-premises resources into the cloud via Application Insights.

Azure Monitor has some nice features on top of what CloudWatch provides. Within your log analytics, you’re easily able to gain further insight into your usage. For example, when looking at logs from a load balancer, you’re able to view on a map where your traffic is coming from and the metrics from each location. This drill-down capability is convenient when trying to address user-specific performance issues.

While AWS CloudWatch has dashboarding capabilities within it, Azure instead has a separate service called Azure Dashboard. Once you’ve customized your dashboard with whatever metrics and tooling you want, you have the ability to share this dashboard with other users. You can view the full documentation here.

What Should I Monitor and Alert On?

When building a DataOps strategy, it’s important to define what should be alerted and to what level. If your cloud bill is higher than normal, you likely want to know, but it isn’t as critical as your data pipelines or database being down. There’s also a variety of ways to handle alerts within your system.

As a general recommendation, your business should identify what is business-critical. This should be directly correlated to the impact on your business when these alerts happen. When services have issues, there’s always a scale of impact. If your company relies on a single database for all your transactions and that database goes down, you’ll want all the bells and whistles going off to quickly alert your developers and business of the issue.

A good starting point for monitoring would be to set up alarms for the following:

Health checks to your critical resources
Response times for requests made to your system and intra-system
Database/Disk space available
Cost/budget

Whether you use AWS CloudWatch, Azure Monitor, or an APM, you’ll have the ability to define rules that trigger alerts to be sent out. Depending on the platform, the implementation for alerting is going to differ.

Business Continuity

Alerts and monitors are crucial to visibility into your applications and infrastructure, but what happens when your infrastructure is unavailable? A data center could lose all connectivity or start on fire, and suddenly, your business is unable to function. Your business needs the ability to insulate itself from single points of failure.

There are many ways to solve this problem and depend on the risk tolerance of your business. There’s also many pros and cons depending on how your business wants to address this problem. You’ll also need to define what your recovery point objective (RPO) and recovery time objectives (RTO) are. These establish how long your business can be down for and what data loss there might be when swapping between infrastructure resources.

Let’s walk through some disaster recovery scenarios.

Scenarios

There are a few different scales of outage that can happen at an infrastructure level. The risks associated with each scenario will need to be weighed against the probability, infrastructure costs, and development costs to implement a solution. This will also vary depending on whether your business uses on-premises infrastructure, single cloud, or a hybrid cloud strategy for your infrastructure resources.

Single Cloud Provider

Within a single cloud provider, nearly every provider offers capabilities to handle failures at a particular data center or region. The requirements to implement disaster recovery will depend on what resources your business uses, but let’s assume that your business is using a database that is required to be available as much as possible. You have a few options for handling a disaster scenario with these resources:

The implementation will greatly depend on how you’ve decided to host your database, what tooling you’re using, your RPO/RTO, and if you want to automate this process.

Hybrid Cloud Provider

While many cloud providers have insanely high availability of their infrastructure as a whole, you may want to insulate yourself from a cloud provider entirely going down. This has many pros and cons associated with it:

Pros

Nearly always available
Easier to migrate between cloud providers
Not reliant on a specific implementation of disaster recovery

Cons

Very expensive
Development time is significantly longer and will impact speed to market
Harder to take advantage of cloud provider-specific functionality
Will need to implement disaster recovery scenarios within each cloud provider separately

In the event of a cloud provider failure in this strategy, you will need to ensure that you have the ability to modify your request routing to the available infrastructure. This load balancer will need to live independently of your selected providers to ensure that requests can be handled in the event of an outage.

So How Do I Build a DataOps Strategy?

Now that we’ve talked about all the individual components and concerns when building data products, let’s talk about how to put it all together into a cohesive data strategy.

Every business is going to start from a different technical understanding. While many of these topics may be trivial for a technology company with a strong technical foundation, other companies will find them instrumental. With that being said, what does that roadmap look like?

The core functionality that you will need to nail down and in what order will depend on your enterprise and the business goals that you’re trying to achieve, but the general prescription is going to look like the following:

Establish a source control system such as Git.
- You need the ability to track changes in your system
- You need the ability to have multiple developers easily able to work together on the same product
Review existing processes and workflows
- Identify what can be automated or programmatically ran
- Identify how frequently those tasks need to be run
Establish a pattern for monitoring, logging, and alerting
- You need the ability to track and trace issues within your system and alert on failures
- You need the ability to be proactive to any issues
Establish a method for software builds and releases
- This usually starts with manual builds and deployments
- You’ll want to track what version of your application is currently deployed and have the ability to use that version in local development for debugging purposes

This will enable your business to more reliably and efficiently make changes to your application and monitor for any issues as a result of those changes. Once you can reliably identify changes and introduce changes to your system, the next step is to set up automated testing to reduce the burden of manual validation. This should be set up in a build/deployment pipeline to ensure the tests are executed consistently and changes are validated before releasing to users or consumers of your system.

Unit Testing
- This is easier to implement than integration testing
- Ensures code changes don’t break existing functionality at a low level
Continuous Integration, Continuous Deployment, and Infrastructure as Code
- Once you have reliable and tested code, you can more frequently release your code
- When you release your code, you should also update your infrastructure to match
Integration Testing
- Ensures code changes don’t break existing functionality at a system level
- Many companies do this in tandem with unit testing, however, it’s more complex and takes time to do well

This then sets your business up for a modern DataOps strategy that can be expanded upon depending on your needs. You can continue to expand and automate your build and deploy strategy with tooling like linting, standardization, shared libraries, etc.

DataOps.Live: Pulling It All Together

Being founded by software engineers, phData has always focused on DataOps principles when building data products. Similarly, the founders of DataOps.Live are thought leaders in evangelizing these principles. Further, they created a robust product that provides an opinionated solution for developing and operating data products.

phData is partnering with DataOps.Live to accelerate our joint customers delivery of data products. Learn more about this exciting partnership by reaching out to phData today!

In Summary

Building a DataOps strategy requires an array of different decisions, concerns, components, infrastructure, and established patterns to be effective. The decisions that are made for each component detailed for a DataOps strategy are going to depend on your individual business needs, capabilities, resources, and funds.

Like most things in technology, there are several ways to accomplish the same tasks, each with its own pros and cons. You will need to identify the current state and desired state of your business in the short and long term, identify the current gaps, and establish a road map to address these gaps.

If you’re new to many of the topics above or need help establishing your DataOps strategy, phData can help! Our team of seasoned data experts can help you design, build, and operationalize your modern data product.

DataOps: What Is It, Core Principles, and Tools For Implementation

How Impactful is Your Data?

What’s Data Strategy?

Roadmapping The Data Strategy Journey

How Do Software Engineering Principles Solve This?

Product Development

Waterfall Framework

Agile Framework

Why Is This Challenging?

What Are The Components of a DataOps Strategy?

Source Control Management

Subversion (SVN)

Git

Advantages of Source Control Management

Pull Requests

Change Management

Challenges with Source Control Management

Infrastructure as Code

Terraform

CloudFormation

Cloud Foundation

Build/Deploy Strategy

Compilation and Transpilation

Minification and Uglification

Containerization

Publishing and Versioning Build Artifacts

Automating Your Build

Continuous Integration and Delivery (CI/CD)

Unit Testing

Integration Testing

Production Deployment Strategies

Update In Place

Blue/Green Deployments

Blue/Green Deployments with Data Products

Data Quality and Validation

Information Architecture

Data Profiling

Data Profiling In Practice

Batch vs Streaming

Workflow Management

Batch vs Streaming

Performance and Availability

Cost

Orchestration

Building Orchestrated Workflows

Data Modeling

Monitoring and Logging

What Should I Log?

Where and How Should I Log?

Docker Containers

AWS CloudWatch

Azure Monitor with Log Analytics

What Should I Monitor and Alert On?

Business Continuity

Scenarios

Single Cloud Provider

Hybrid Cloud Provider

So How Do I Build a DataOps Strategy?

DataOps.Live: Pulling It All Together

In Summary

More to explore

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit