October 15, 2020

Trials and Tribulations Preventing Silent Data Loss

By Ella Chen

A few weeks ago, a colleague from another project team reached out to me for help with an urgent issue. Errors and irregularities had crept into reports tied to one of their StreamSets pipelines, to the point that the business had complained about the data quality.

But when the developers inspected the pipeline, they had no clue as to what data was lost, how it got lost, or why this happened.

A Silent Killer

The team was dealing with an issue known as “Silent Data Loss”. This happens when the data pipeline seems to run perfectly, without any exceptions or other obvious indications of error but turns out to be missing partial data. It’s a common enough problem, but one that’s often extremely difficult to diagnose and troubleshoot. It might take a few days or months until users discover the loss. Other times, they might even not notice the issue at all.

So, how do we go about preventing silent data loss? We start by finding where the loss is taking place.

Uncovering the Symptoms

In this particular case, StreamSets was used as the ingestion tool for the original data pipeline. It pulled data from a RDBMS using JDBC Query Consumer and wrote to a storage system (in this case, ADLS).

Finding data loss between a RDBMS and Azure Data Lake

I duplicated the pipeline and changed the destination into a Hive table and HDFS — in part because I lacked access in ADLS, but also because checking data in Hive/Impala queries tends to be more straightforward.

Recreating the pipeline in Hive and HDFS

To reproduce the issue, I used the exact same JDBC query to pull data, then kept the pipeline running for a few days. It was running perfectly, without any error message, and the row counts of the Hive table accumulated every day. Next, I used Spark as an alternate ingestion tool to pull a snapshot of the source table for comparison.

Using Spark as a new ingestion tool to see what data fell out of the pipeline.

That’s when I found intermittent data loss between the source table and Hive table. Every day there were a few new random data points not showing up in Hive — clear evidence that we were indeed dealing with a case of silent data loss.

Diagnosing the Root Cause

I examined each missing row, hunting for common shared attributes that might give some clue as to the issue’s cause. Could it be related to a specific primary key? A specific user? A specific type of business?

Finally, I found that almost all the missing records were updated in the AM hours; this value was stored in the column “modified_date”, which was used as the offset column to track the incremental ingestion into the pipeline.

This is where the methodical diagnostic process gives way to experience and gut instinct: I had a hunch — even if it was only a wild guess — that there was something wrong with this column.

After inquiring with the source database team about how the value of “modified_date” got inserted into the source table, I identified three relevant processes that had access to the production table:

  • Individual database user entry
  • Another application connection
  • Excel Macro upload

Finding and preventing silent data loss from a RMDBS

In the first scenario, each user would update or insert a record via SQL query and the “modified_date” value would be generated through the production database. In the second, the application interacted with the production database via JDBC connections and used system generated timestamps. And in the last, a group of business users would manually collect new records daily in an Excel file, then load them to the production database as a batch.

The “modified_date” column in this last case was pre-populated by the Excel macro, which might not always be in sync with the current timestamp. This resulted in some records having an outdated value of “modified_date” rather than the real timestamp.

And ultimately since the pipeline could only pull the newest data — i.e. data with a numerically greater timestamp than the last offset value — those out-of-sync records with a smaller timestamp value were being ignored during the incremental ingestion. And so here we had it: the cause of our silent data loss issue.

The Solution: Trigger Logic

After weighing a few possible solutions, we proposed the trigger solution to make sure the “modified_date” column value is always up to date. Whenever records were inserted or updated, the trigger logic would guarantee the timestamp value was set no earlier than the real system time.

The source database team accepted this solution and added a new column – “last_update_timestamp” – with the value set to the current system datetime, which they then implemented trigger logic on. Because this solution had minimal impact on the production model — without any change required to the original column — it was implemented quickly and smoothly.

At last, after updating to use “last_update_timestamp” as the new offset column for incremental ingestion, there was no more silent data loss in our data pipeline.

Lessons Learned

Incremental mode is a popular batch ingestion method for big data projects because it pulls only the newest batch of data for next-step processing. Because of this, data gets loaded faster when compared with truncate reload, which dumps all the data for each batch.

The most important part of incremental ingestion is choosing a good candidate field from the table as an offset column, which is used to determine where to pick-up for the next batch of data.

Ideally, a primary key column or an indexed column would be a perfect candidate. When the source table lacks those columns, always confirm with the source team which candidate columns might fit. Otherwise, if we don’t completely understand the offset column, there is a potential chance for silent data loss.

But what really makes preventing silent data loss so tricky is that there are so many possible causes. It’s often hard to even know where to start looking — which is why it’s always good to have someone you can turn to who’s gone through it all many times before.

Looking for more information on managing your data pipelines?

We know it can be a challenge to maintain the infrastructure, cloudops, optimization, availability, and data protection of your pipelines. To make this easier to manage, phData invested in the Managed Data Platform so you don’t have to. Let us manage your pipelines so you can manage your business.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit