Introduction
The Data Science Life Cycle
Discovery – The discovery phase generally determines the overall viability of the project. A team will identify, get access to and ingest any data necessary for the project. Often times, some preliminary analysis will be done in this phase of work.
Model Training – During model training, a data scientist will go through the work necessary to train/fit the model to address the business problem.
Deployment – Deployment, as you may imagine, is a phase dedicated to deploying a model and delivering results to the business.Â
Monitoring – This phase is intended to monitor the performance of the model(s) and to enable a feedback loop to the Discovery/Training phases when a model is not performing as intended.Â
In order to achieve success on a Data Science project, a team must deliver on all of these phases of work; a failure to do so can result in minimal returns or worse, an impact to existing business. Due to its importance, I’ll repeat myself, failure to invest specifically in model monitoring can result in large negative impact to existing business.Â
In a World Without phData
In a world without phData, many organizations have struggled or outright failed to obtain a return on their investments in Data Science and Advanced Analytics. In some cases, the enterprise has gone through great trials and tribulations simply to get their analytics initiatives off the ground only to find that they can’t seem to provide their Data Scientists with what they need. For those that have managed to make it past the initial phases of data discovery and model training, they’ve found it virtually impossible to deploy the developed models into production. Lastly, in a few cases where companies have been able to deploy a model in some form, they fall short of the appropriate level of monitoring for the model and later discover that the trained model is no longer performing and causing significant loss to the business.
Based on our experience, almost all of these issues come as a result of underinvestment in the architecture, engineering, and operations work. We’ve seen many enterprise Data Science organizations turn to their existing data and analytics support teams, only to find that they are unfamiliar with the concept of predictive analytics and even more inexperienced with the tools and technologies required to run a successful Data Science team. As a result of this lack of assistance, Data Scientists and Statisticians are left holding the bag for much of the architecture, engineering and operations work involved in the project life cycle.
In a World with phData
In a world with phData, enterprises are discovering that with the proper level of investment in architecture, engineering and operations they can achieve huge returns on their existing data science investments. phData has helped significantly reduce the amount of time spent in project onboarding and data ingestion at our clients.
Through the use of Heimdali, our customers have been able to enable self-service project onboarding, getting projects off the ground quicker and reducing costs in the process. In data engineering engagements, our use of Pipewrench, an open source ingestion framework developed by phData while working with a client, automates the creation of data ingestion pipelines in StreamSets and Sqoop; increasing the speed to data that Data Scientists so desperately need.
We have provided general project and technical consulting to Data Science teams through our Data Science Center of Excellence, enabling teams to more effectively use their Cloudera platform and the tooling along with it, such as Cloudera’s Data Science Workbench.
Perhaps most importantly, we have placed strategic investments in the expansion of our Machine Learning Engineering team, focusing on talent that has a background in both Data Science and Engineering disciplines. Our machine learning engineers have provided enterprise-grade, production solutions for any number of disparate models.
Lastly, in the deployment process, our machine learning engineers work with data scientists to define and develop appropriate model metrics that will be monitored on Pulse, an open source application monitoring framework developed by phData. Then, our operations team provides 24×7 intelligent support using both the monitoring and alerting capabilities of Pulse.