How to Successfully Implement Data Science at Organizations: Failing and Learning Fast

The field of data science is one of the most popular topics in business. Companies have disrupted stale industries by introducing groundbreaking modeling techniques that have changed the way business is done. Seeing these large wins in data science has driven the race to adopting predictive and prescriptive modeling in business.

 In 2019, Gartner estimated that only 20% of data science projects would actually give meaningful outcomes by 2022. YIKES! That’s a pretty bad statistic, especially considering the cost of labor and computing power necessary to build the models in the first place. Unfortunately, this statistic does not surprise me at all. As everyone is racing to incorporate data science into their organizations, many are not taking the right approach.

What Does it Mean to Fail Fast?

The concept of failing fast has its roots in software development. When a program cannot successfully execute in a given amount of time, the job is truncated and an error message is given. In data science, failing fast means to move forward with an experiment, but to cut it off if it does not create enough value in a determined amount of time. The goal with failing fast is to try a variety of experiments in a short period of time, stop development where there is no value, and scale development where there is.

Why Do Data Science Projects Fail?

Data science projects can fail for a variety of reasons, but there are a handful of commonalities between failed engagements.

Lack of Data

At its core, data science is the combination of mathematics, computer engineering, and business acumen. The source that fuels data science is data. Historical data is the most critical component of a data science project. The biggest reason data science projects fail is because there is bad data. Bad data can be a variety of things: not at the level of granularity necessary, extremely messy or incomplete, not relevant for the current problem, or simply unavailable. Without extensive and relevant data, there is no foundation to build a data science solution. You would not build a house without a strong foundation, so do not try to build a data science solution without good data.

Lack of Business Support

The best data scientists in the game are extremely talented at translating business requirements to mathematical models and technical requirements. Understanding how a business runs its operations for a particular domain is critical in formulating the problem and figuring out the best approach. While this sounds straightforward, it can ultimately result in a lot of meetings and discussions that can bog down key business stakeholders. If the stakeholders are not fully invested into the process and ready to support the development team, there tends to be a lot of assumptions made that might not be accurate. The lack of buy-in can result in subpar model tuning and performance.

Lack of Technology Support

In the case that a successful model has been built and has been completely adopted by the business, the next phase is having the model run on a scheduled basis and updating with the most recent data. While this sounds like it would just be flipping on a switch, pushing models into a production environment can be one of the most difficult parts of a data science solution. There are a lot of nuances with the current state of most organizations’ IT infrastructure including on-prem / cloud-based data sources (or maybe even some Excel files), resources availability to run the model at a given time, and scheduling the job so it fits into all of the IT processes. Failure to integrate a model successfully into a company’s ecosystem causes it to fail as a solution.

Great, so investing in data science is not going to work? Well, not exactly!

How to Successfully Implement a Data Science Project

Most failed data science projects could have been avoided earlier on before they resulted in large-scale failures. It is true that data science will not always be the solution to a particular problem, but creating a process to quickly test and scale has helped many organizations avoid large flops. The key? Start small!

Proof of Concept

The proof of concept (PoC) phase of a project is a “back of the napkin” solution. The goal of this phase is to understand the availability and quality of the data that is expected to be used and if data science is the right approach for the problem at hand. In this phase, there are several business discovery sessions with key stakeholders, lots of data investigation, and a small scale model. The modeling focus in a PoC is usually a singular product or region. PoCs can vary in length depending on the goals and stakeholder availability, but they can be anywhere from 2-6 weeks. If the results of this phase are not promising, it is easy to stop the engagement and shift focus elsewhere. If the results are good, then it is a solid indicator that there could be more to gain from this project.

Prototype

The next step in the process is a prototype. This phase takes the data findings, business acumen, and basic modeling a step further. Usually, more products/locations will be included in the scope of the analysis along with trying additional model refinement techniques. Prototypes are where the business tends to get even more involved in the process by conducting model reviews. Does the model make sense? Is it adding additional insight into solving the problem at hand? The model most likely will not be perfect in this phase, but it still gives solid results. In this phase, the data scientist will try creating and adding new variables from current and new data sources to see if there is an impact. The length can be variable but it is usually in the ballpark of 8-12 weeks. Again, if the results are not promising, it is relatively easy to stop the engagement at this point before continuing to invest.

Minimum Viable Product

If it makes sense to continue, the next step would be to scale the model out to a minimum viable product (MVP). In this phase, we’re not looking at a niche subset of data, but looking at a larger portion of the portfolio. Now, the model could be running off of North America as a region or maybe throughout top categories in the portfolio. This phase continues to find ways to improve the model and adjust it accordingly as it continues to scale out in the portfolio. Business reviews are also critical in this phase so the data scientist can understand nuances in the business behavior and how to adjust appropriately. At this phase, there is pretty good indication that there is business value in the current state of the solution. At the end of this phase, the solution is equivalent to a house that is built, but does not have plumbing or electricity. It can work as it and provide information, but it cannot be counted on reliably or be run automatically. MVPs again vary in time, but they are usually around 12-18 weeks. Again, if the results are not satisfactory, then the solution can be stopped before allocating larger IT resources on a recurring basis.

Production

If a solution reaches this point, there is usually a very good indication that it is a worthwhile investment. The production phase of the process scales the model to the entire portfolio while integrating it into the live IT ecosystem. In this phase, it is crucial to understand all of the upstream and downstream dependencies of the model. There may be certain data sources that have to be refreshed each week before a model can run, but it may have to be scheduled at a particular time so it does not interfere with other models or data refreshes. At a small company, this may be straightforward. But for medium to large organizations, the coordination in this phase is extremely difficult. Production of a model can take 12+ weeks to complete, but will need recurring monitoring so that the model continues to run and provide satisfactory results.

Conclusion

The scaled approach mentioned above helps indicate if there is a risk in terms of data availability, business buy-in, and technology support. It will become evident in the PoC and MVP phases whether or not there is a huge risk to scaling out the project. Sometimes data science will not be the approach and it can be frustrating for both the data scientists and business stakeholders, but it is better for everyone to determine that outcome in the PoC phase rather than trying to build and productionalize something from day 1. Data science solutions are complex and take time to build for great results. When these solutions are built correctly, they can offer insights that can disrupt industries!

Implement a successful AI strategy at your organization by following this step-by-step guide.

Accelerate and automate your data projects with the phData Toolkit

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.