What is Data Engineering? Everything You Need to Know in 2021

A picture with data all around with the phData logo in the center

What is Data Engineering? Everything You Need to Know in 2021 It’s easy to overlook the amount of data that’s being generated every day — from your smartphone, your Zoom calls, to your Wi-Fi-connected dishwasher. It is estimated that the world will have created and stored 200 Zettabytes of data by the year 2025.  While […]

How to Build a Modern Data Platform Utilizing Data Vault

Detailed, yet simple diagram of an example data vault

When looking to build out a new data lake, one of the most important factors is to establish the warehousing architecture that will be used as the foundation for the data platform. While there are several traditional methodologies to consider when establishing a new data lake (from Inmon and Kimball, for example), one alternative presents […]

How Do I Use StreamSets Test Framework?

Picture of several lines of code

StreamSets Test Framework (STF) is a set of Python tools and libraries that enables developers to write integration tests for StreamSets: Data Collector Control Hub Data Protector Transformer This unique test framework allows you to script tests for pipeline-level functionality, pipeline upgrades, functionality of individual stages, and much more according to the requirements. But the […]

How does Kubernetes Horizontal Pod Autoscaling Work with Custom Metrics?

A sample workflow for Kubernetes Autoscaling

Kubernetes is a great way to deploy cloud-native applications in the cloud or on-premises. One of the Kubernetes Pod Autoscaling features’ biggest advantages is to automatically scale your application based on demand. This can be extremely helpful when the load an application encounters is variable. Kubernetes has three different types of scaling: Cluster scaling, Vertical […]

How to Know if Your Data Engineering Projects Will be Successful

Data and analytics platform diagram for successful data engineering projects

To make sure data engineering and analytics projects are successful, not only do you need to pick the right technology and have the right people; you also must have the discipline to apply software engineering best practices. What sort of practices am I talking about? Make sure your requirements are clear and communicated to all […]

Trials and Tribulations Preventing Silent Data Loss

Finding silent data loss between a RDBMS and Azure Data Lake

A few weeks ago, a colleague from another project team reached out to me for help with an urgent issue. Errors and irregularities had crept into reports tied to one of their StreamSets pipelines, to the point that the business had complained about the data quality. But when the developers inspected the pipeline, they had […]