July 30, 2018

Data Science Enablement: The First of Its Kind

By Jordan Birdsell


Here at phData, we are proud to announce our newest service offering in the Big Data space, Data Science Enablement. phData has driven success in Big Data initiatives for the enterprise for years and we’re excited to now complete the circle of services by enabling organizations to maximize return on the investments they’ve made in Data Science and Advanced Analytics. By combining our existing world-renowned expertise in Cloudera platform management and application development with a series of strategic hires in the Data Science field, phData is uniquely equipped to provide the first and only Data Science Enablement consulting service in the world. Since many organizations have had difficulty nailing down what the issues are that are affecting their Data Science teams, we’ve taken the time to provide insight on the problems and solutions.

The Data Science Life Cycle

The data science life cycle, as we see it, is effectively broken into 4 phases of work. While some organizations may not divide the work precisely as we’ve defined below, you can quickly see where your work aligns to these four stages.
Data Science Life Cycle

Discovery – The discovery phase generally determines the overall viability of the project. A team will identify, get access to and ingest any data necessary for the project. Often times, some preliminary analysis will be done in this phase of work.

Model Training – During model training, a data scientist will go through the work necessary to train/fit the model to address the business problem.

Deployment – Deployment, as you may imagine, is a phase dedicated to deploying a model and delivering results to the business. 

Monitoring – This phase is intended to monitor the performance of the model(s) and to enable a feedback loop to the Discovery/Training phases when a model is not performing as intended. 

In order to achieve success on a Data Science project, a team must deliver on all of these phases of work; a failure to do so can result in minimal returns or worse, an impact to existing business. Due to its importance, I’ll repeat myself, failure to invest specifically in model monitoring can result in large negative impact to existing business. 

In a World Without phData

In a world without phData, many organizations have struggled or outright failed to obtain a return on their investments in Data Science and Advanced Analytics. In some cases, the enterprise has gone through great trials and tribulations simply to get their analytics initiatives off the ground only to find that they can’t seem to provide their Data Scientists with what they need. For those that have managed to make it past the initial phases of data discovery and model training, they’ve found it virtually impossible to deploy the developed models into production. Lastly, in a few cases where companies have been able to deploy a model in some form, they fall short of the appropriate level of monitoring for the model and later discover that the trained model is no longer performing and causing significant loss to the business.

Based on our experience, almost all of these issues come as a result of underinvestment in the architecture, engineering, and operations work. We’ve seen many enterprise Data Science organizations turn to their existing data and analytics support teams, only to find that they are unfamiliar with the concept of predictive analytics and even more inexperienced with the tools and technologies required to run a successful Data Science team. As a result of this lack of assistance, Data Scientists and Statisticians are left holding the bag for much of the architecture, engineering and operations work involved in the project life cycle.

In a World with phData

In a world with phData, enterprises are discovering that with the proper level of investment in architecture, engineering and operations they can achieve huge returns on their existing data science investments. phData has helped significantly reduce the amount of time spent in project onboarding and data ingestion at our clients.

Through the use of Heimdali, our customers have been able to enable self-service project onboarding, getting projects off the ground quicker and reducing costs in the process. In data engineering engagements, our use of Pipewrench, an open source ingestion framework developed by phData while working with a client, automates the creation of data ingestion pipelines in StreamSets and Sqoop; increasing the speed to data that Data Scientists so desperately need.

We have provided general project and technical consulting to Data Science teams through our Data Science Center of Excellence, enabling teams to more effectively use their Cloudera platform and the tooling along with it, such as Cloudera’s Data Science Workbench.

Data Science Stack Diagram

Perhaps most importantly, we have placed strategic investments in the expansion of our Machine Learning Engineering team, focusing on talent that has a background in both Data Science and Engineering disciplines. Our machine learning engineers have provided enterprise-grade, production solutions for any number of disparate models.

Lastly, in the deployment process, our machine learning engineers work with data scientists to define and develop appropriate model metrics that will be monitored on Pulse, an open source application monitoring framework developed by phData. Then, our operations team provides 24×7 intelligent support using both the monitoring and alerting capabilities of Pulse.


My belief is that the enterprise should continue to focus on growing their competitive advantage and intellectual property through the development of actionable Data Science models. phData has shown customers the light at the end of the tunnel in some of their darkest and most challenging early days of Big Data and we are very excited to now provide that same hope and experience to customers struggling to see a way out of the daunting challenges of Enterprise Data Science. I’ve seen this problem plaguing countless organizations, but I have also seen what success can look like, and as such, I am very passionate about providing solutions to solving this problem. I look forward to speaking with you in the future and welcome you to reach out!

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit