November 2, 2020

How to Know if Your Data Engineering Projects Will be Successful

By Jeff Mortimer

To make sure data engineering and analytics projects are successful, not only do you need to pick the right technology and have the right people; you also must have the discipline to apply software engineering best practices. What sort of practices am I talking about?

  1. Make sure your requirements are clear and communicated to all parties
  2. Make sure they’re tested before calling it “done”
  3. Make sure they’re tested on a regular basis

Scenario: Multiple SAP Systems

Let’s take a look at a scenario involving a company that has multiple SAP systems, and they’re ingesting that data into some sort of data & analytics platform. They want to combine the information for reports and financial forecasting. And those SAP systems apply row-level security (RLS) so that most employees can only see the data in the system that was generated by the business unit they belong to.
Data and analytics platform diagram

In the diagram above, let’s say that Alice has been granted access to all the data in all of the SAP systems, Bob can only see data that was generated by business unit 1234, and Lin has not been granted access to any SAP data. And in one scenario we need to build a view that aggregates data across business units to give us a company-wide look at financial results.

Before building anything that combines data from the different systems, we need to think about what Alice, Bob, and Lin will end up seeing if they run a query against the table or view. Will they see the same results, or will they see different results due to their access privileges? There’s no right or wrong answer to that question — it depends on what you’re building. But you need to make sure that everyone is on the same page before starting development — the business analyst, the data engineers, the people doing the testing, and the consumers of the product once it’s completed. If you don’t spell it out, chances are that not everyone is on the same page. For instance, a possible implementation might look something like this:

Data and analytics platform diagram
A data engineer (let’s call him Dan) took advantage of some preexisting views that apply the same row-level security as the source systems. By the way, everyone on Dan’s team has access to all of the data in the SAP systems. He does some testing, thinks it’s done and announces that this new combined view is available for use. Alice starts using the view, likes what she sees, and builds the results into an interactive visualization. Others start using the visualization… Due to the RLS, Bob’s results are slightly different than Alice’s, but he doesn’t notice. When Lin tries it, nothing shows up. She tells Alice her visualization is broken. Alice says “it looks good to me.” Lin is frustrated…

Clear and Communicated Requirements

How can this situation be avoided? Make sure the requirements of your data engineering project are clear and communicated to all parties. Row-level security should be explicitly addressed in all requirements involving systems like this. They should have either said “Results should be the same regardless of the access level of the user”, or “Results will differ depending on the access level of the user.”

The requirements in the scenario above were to build a view that aggregates data across business units to give us a company-wide look at financial results. That seems to imply that anyone using the view will get the same results. Don’t make assumptions — spell it out. Say that “Everyone using this view will get the same results, regardless of row-level security.” Maybe Dan didn’t realize this when building the combined view — if he had, he wouldn’t have built the solution by reusing the views with RLS.

Test Before Calling it Done

Once requirements are clear and spelled out, you need to make sure that they’re tested before calling it “done.” The first step in that is, who’s doing the testing? Obviously, the data engineer should be testing, but it’s likely that they have a blind spot or two. Did they read the requirements? Did they interpret them correctly? Did they check to ensure all the requisite software engineering best practices were correctly followed?

For complex data engineering and analytics projects, it’s almost always better to have someone else do some testing, in addition to the data engineer, before calling it “done.” And whoever that person is, they’re going to need to have different identities or roles set up to simulate all of the different users we mentioned above — Alice, Bob, and Lin. It’s especially important to be testing with users that have access to most, but not all of the data — partially limited access can lead to slight differences that often get overlooked. You don’t want to be introducing errors like that into your analytics.

Okay, let’s say you’ve spelled out the requirements. And you’ve done the initial testing to make sure all those different identities are working correctly. Awesome! Life is good! What could possibly go wrong?

Scenario: Expose Data Users can Access

Let’s take a look at a different scenario: a view that only exposes data that the user has access to. The requirement: A view that generates a list of general ledger entries across all of the SAP instances. Users will get different results depending on their access. Dan takes that requirement and implements it by reusing the views with RLS:
Data and analytics platform diagram
The new view is tested and everything’s working great! Alice sees all the entries, Bob only sees the entries from business unit 1234, and Lin doesn’t see any of them. Things work as intended. Time passes, everyone’s happy. Then as more and more data is loaded into System B, performance starts to degrade. Dan fixes it by materializing the data…
Data and analytics platform diagram for successful data engineering projects
And Dan forgets to implement RLS. So now Bob and Lin are seeing more data than they should be… 

Test on a Regular Basis

The final point is to run tests on a regular basis. What once worked is not guaranteed to keep working. What sort of tests and the execution schedule depend on the situation, but some sort of test suite should be run before promoting any changes into your production environment. 

Data engineering and analytics projects are complex. Unexpected problems will of course arise; but by following a few straightforward but oft-overlooked software engineering best practices — namely, clarifying requirements and ensuring testing from the outset — you can help ensure that your technology investments ultimately succeed. And if you need help with your data engineering projects, the phData team is here to help! Reach out to us at info@phdata.io to get in touch with our experts.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit