June 18, 2019

The Ultimate Guide to Building a Machine Learning Solution

By Jordan Birdsell

Did you know that more than 90% of the world’s data was generated in the last 2 years? That’s why machine learning models that find patterns in data and can make decisions are critical for businesses today. Organizations everywhere, across all industries, are fast approaching the largest disruption to hit the business world since the Industrial Revolution, and sadly, most are unprepared for the road ahead. In this blog post, we will provide you with a simple, yet practical, pattern that can be used to deploy value-generating AI and machine learning solutions to solve virtually any business problem.

You will walk away from this machine learning guide with:

  • Confidence that machine learning solutions can solve real business problems
  • Understanding that machine learning solutions aren’t magic
  • Knowledge that these solutions can be designed in a simple, repeatable way

What is Machine Learning?

While it is not our goal to teach you machine learning, if you are new or somewhat unclear on the topic, hopefully this introduction will make the remainder of this post easier to digest. You may find it useful to think of a machine learning model as a method or function in a traditional application that, when provided with valid input, will return a result. The main distinction between conventional and machine learning functions is how they are composed. In a traditional Java application, for example, a software engineer must come up with explicit instructions that a machine will execute in order to produce the desired outcome. With machine learning, on the other hand, we provide examples of our desired outcome to a computer and let it determine the instructions on its own. This groundbreaking new programming paradigm allows us to tackle challenges that had previously been very difficult or impossible for software engineers to solve.

The Project Lifecycle for Machine Learning

We will break our work into 4 easy steps to mirror how a project should be executed in a real business setting. We use these lifecycle phases to help us break down our problems into logical steps for developing and maintaining our machine learning solution.

Before we proceed, it’s important to understand that, while we will be focused on a single-use case in this writing, the architecture we will use for this solution is applicable to virtually any business problem. We’ve listed just a few examples below:

  • Healthcare: predicting medical issues from patient data while they are under care
  • Customer service: routing phone calls based on anticipated customer needs
Machine Learning Process

Now, let’s take a closer look at our use case and the tools we intend to use to implement our machine learning solution.

Defect Detection in Manufacturing

Imagine for a moment that you are a Production Line Executive for a manufacturing plant. You are responsible for meeting your production goals in a timely and cost-effective way, all while maintaining high-quality standards. One of your biggest challenges is maintaining product quality, and as the number of production defects increases, costs rise and time is lost.  

We will demonstrate how to use machine learning with a simple set of tools to implement a manufacturing defect detection application, helping you, the Production Executive, to meet your goals on time and to reduce any liability associated with defective products, such as lawsuits resulting from faulty medical devices.

With our machine learning model, we are focusing on classifying defects in steel plates. Our data for this example project comes from research conducted by Semeion, Research Center of Sciences of Communication. Remember, however, that our focus is on the capabilities delivered by this solution’s architecture and not on the sample application.

Tools and Software

While we could write an entire blog post on the pros and cons of different machine learning platforms and tools, we’ve selected a stack that has proven itself and meets the rigorous standards of enterprise organizations across the globe. The tools we have selected for our machine learning solution are largely centered around Cloudera, an enterprise data cloud platform with products and solutions covering everything from the Edge to AI.  

Below, you will find a list of the tools we will use and links that you can follow to get more information:



Apache MiNiFi/NiFi


Apache Kafka

Streaming Platform

Apache Kudu


Cloudera Data Science Workbench (CDSW)

Data Science

Arcadia Data


phData's Pulse


Step 1: Ingestion

The most crucial input to any machine learning project is, you guessed it, data. We will begin this project by ingesting our data so that we can conduct our analysis and train a model. It’s worth noting that we intend to ingest our data in a streaming manner as our solution will require near real-time action and we find it best to not add latency unless it is necessary.

Now, let’s take a look at the different steps involved in ingesting our data from the edge, the manufacturing line in this case, and then we will show how the tools we mentioned earlier align with those steps.

  1. To begin, we will need to collect data from the edge. In our sample project, the edge is a manufacturing plant and an edge device would be a machine that gathers data from the manufacturing line. For awareness, the data we collect in this use case provides us with information about the configuration of the manufacturing line as well as some details about the items produced.  
  2. Once we’ve collected all of the applicable data points on our edge device, we need to push them to a central data platform for processing and storage. Later, when we discuss our deployment strategy, you will see a similar communication pattern is required. Upon arrival in our data platform, our data needs to be cleansed and enriched before being queued for storage.  
  3. It is important that we buffer our data in a streaming architecture so that we can decouple the different steps in our stream, preventing processes from being blocked.  
  4. Finally, with our data transformed and put into a buffer, we can flush our data out to its final resting place, on disk.
Machine Learning Ingestion

We will now examine how we can apply some of the tools mentioned earlier to create an ingestion pipeline:

  1. We will start by looking at data collection on the edge. Here, we have chosen to use Apache MiNiFi, a lightweight and portable agent developed with this exact solution in mind.  
  2. Now, let’s look at our need to centralize and transform our data. In our project, we have selected Apache NiFi to address these needs. NiFi is an easy to use, powerful, and reliable system for processing and distributing data at scale. As you may have guessed, MiNiFi is an extension of the NiFi product so there is no concern about integrating the two.
  3. If you recall, it is essential that we buffer our data between steps in a streaming application. Apache Kafka is a scalable, fault-tolerant streaming platform that is widely used to enable real-time streaming data pipelines, and we intend to use it as our buffering mechanism.  
  4. As we approach the end of our ingestion pipeline, we must select a storage layer to persist our data that can handle fast, changing data while being able to support the needs of our analytical use cases on top. Our requirements naturally lend themselves to the use of Apache Kudu, a distributed, relational data store that was created to address use cases just like ours.
MiNiFi Nifi Kafka Apache Kudu
We’re now ready to implement this pipeline. With data now streaming into our data store from the manufacturing line, it is time that we start our analysis. Let us take a closer look at what it takes to train a Machine Learning model with our newly ingested data so that we can predict manufacturing defects.

Step 2: Training

Our next objective is to build a machine learning model that can properly classify defects for steel plates that move across the manufacturing line. Now, let us look at the steps we must take to train a model and the tools that will help us do that.

  1. Our first task in training our model is to become more familiar with the data we intend to train our model against. Broadly classed as exploratory data analysis, we must make sure we have an adequate understanding of our data before we attempt to use it.
  2. Since we are streaming our data into our store, training a model on our current dataset would be a bit like trying to hit a moving target. To avoid this problem, we will need to take samples from our dataset that we can use to train and test our model.
  3. Equipped with the knowledge we gained in our exploratory data analysis and our sampled data, we are now ready to begin training our model. During the training process, we will run a number of different experiments, tuning parameters and tracking outcomes on each run to help us find the best model. With each of these experiments, we evaluate the model produced by running it against our test data to ensure that it will perform as expected.  
  4. Once we have concluded our experimentation, we can select the trained model that performs the best.
Machine Learning Training Model Process

Let’s take a look at the tools we will use to explore our data and to train our model. We will primarily be talking about Apache Impala and Cloudera Data Science Workbench (CDSW).

  • Impala is a massively parallel processing SQL query engine that we use to complement our storage layer, Kudu, because it doesn’t provide its own query engine.
  • CDSW is a secure, self-service Data Science platform built with the enterprise client in mind, which is an important point to make when most data science tools seem as if they were designed for the hobbyist working out of their basement.

With both Impala and CDSW in our employ, we can explore our data and conduct our analysis in a secure, reproducible way that can scale to handle many projects. Here, Impala provides us with a query engine to allows to aggregate, summarize and manipulate our data; while CDSW, on the other hand, gives us a platform for executing and organizing our work. As we proceed on to our model training and experimentation, we will continue to use CDSW. As a general purpose development environment, CDSW allows us to develop our models in Python, R or Scala, using virtually any library that we desire. The workbench tool additionally provides a feature, aptly named experiments, which provides a containerized approach for running our trials, in addition to supplying mechanisms for dashboarding and tracking our metrics and models.  

Kudu and CDSW logos
Now that we have a model trained and selected, the next step is to integrate that model into our manufacturing process so that we can begin identifying defective steel plates and driving real business value.

Step 3: Deployment

As we proceed to the deployment phase of our project, let’s consider the needs of our use case and how we might best integrate our machine learning model into our business process. While there are many different approaches that can be used to deliver the results of a model, we will select a strategy that aligns best with our requirements and enables the reusability of our architecture. For our use case, we are looking to drive two different outcomes:

  • Near real-time defect classification for each steel plate
  • Ability to detect trends in defects over a specified period of time

In order to accomplish our first requirement, we must integrate the trained model directly with the manufacturing line. There are two approaches that can be used here:

  1. Our first option is to physically deploy our model out to the edge device. This approach carries with it the benefit of allowing our edge device to go offline and make predictions independently of any centralized infrastructure. However, there are a number of consequences to this approach that make it less desirable, such as the difficulty of managing the application and its dependencies on the edge.  
  2. Alternatively, we have the option of deploying our model centrally and exposing it over the network to our edge device, via REST or gRPC. Provided that we are not concerned with network disruptions on the edge, this is our best option, as it allows us to centrally manage and control our model.
Machine Learning Model Deployment

Having selected an architectural approach for integrating our machine learning model with the edge, we now must select a tool or series of tools that can help us accomplish this. At this point, a reasonable person may begin to be overwhelmed by the idea of adding additional tools to our solution’s architecture. The folks at Cloudera, however, were mindful of such fears and have enabled us to implement such a solution with the tools we are already using in our project.  

  • CDSW, for example, has a feature called models, which allows us, with just a couple clicks, to expose our model as a REST API.  
  • Additionally, the MiNiFi agent, which we are already running on our edge device, is capable of communicating with our exposed CDSW API, allowing us to make real-time predictions on the edge.  
  • Finally, we are able to place these predictions into our buffering tool, Apache Kafka, to notify other applications on the edge of defects, so that the proper action can be taken.
MiNiFi to CDSW to Kafka
You will find that our second requirement, that we are able to detect trends in production defects over time, is simpler to solve. In fact, we will effectively build this solution in the monitoring phase of our project, described in the next section. Monitoring is absolutely critical to successful machine learning solutions.

Step 4: Monitoring

As we have already mentioned, the monitoring of our machine learning model is critical to the success of our project. In the same way that we are now monitoring our manufacturing process and identifying defects in the products, we must also keep an eye on the performance of our model to identify flaws in its predictions. Tracking statistical performance over time, looking for decay, is so important because data is constantly changing and can cause our model to behave in unpredictable ways, this is called concept drift. Imagine for a moment that, at some point down the road, we change the schematics for our steel plates and begin to crimp the edges. If we implement this change in design, without making appropriate changes to our solution, we are at risk of misclassifying our new crimped edges as scratches. Often times, however, the changes in our data that affect our model’s performance are less obvious and harder to anticipate, and, if gone unchecked, can cause significant financial or regulatory harm.

How can we implement an effective model monitoring solution to avoid the devastating consequences of concept drift?  

  1. First, if our business process can include human validation, then we should always take advantage of this. For our purposes, we already employ a number of Quality Technicians on our manufacturing line to validate product quality. So, we shall use a subset of the current staff to continue to check quality at random, providing us with an essential source of truth to help gauge our model’s performance.
  2. Next, based on our use case, we should define a list of metrics and KPIs that can be tracked to help us understand if we’re experiencing drift. These metrics should be focused on helping us understand both the desired business outcome, such as reducing scrap rate and the distribution of our data, tracking the standard deviation of our plates area for instance.

Once we have all of the appropriate data being collected we can establish dashboards and alerts to allow operators to efficiently monitor our solution. 

Machine Learning Modeling process

With this strategy in mind, let’s examine the tools we will need to implement our monitoring solution and complete our project. The most critical component in this last phase will be the use of phData’s Pulse, a distributed framework for model monitoring and alerting built on top of the Cloudera platform. With the addition of just a couple lines of code in our deployed model, we are able to begin centrally tracking our metrics in Pulse. As for the quality feedback we are set to receive from our technicians, we will host a second API inside of CDSW that accepts this feedback and relays it to Pulse for tracking. With all of our monitoring data now being collected and tracked, we can set up alerts and build our dashboards. For dashboarding, we have a number of different options, however, we have chosen Arcadia Data because of its simplicity and native integration with the rest of our technical stack.

Machine Learning Monitoring with phData Pulse
What we will do once the model has degraded to a point that is unacceptable? When this occurs, we repeat the steps of our project, usually starting with model training as we may not need to ingest any additional data. In fact, a nice side effect of us continuously collecting human-labeled data in our monitoring solution is that we can use that data to refit our model when the time comes.

Bringing It All Together

Congratulations, just like that, you have successfully implemented a robust, value-generating machine learning solution that solves a real-world business problem. Beyond that, you have learned a pattern that can be used across any number of business problems you will face along your journey. At phData, our mission is to accelerate you and your team through this process, to help you begin generating real value in no time. We hope you found this blog post helpful and we encourage you to reach out to us with any questions you have. We would love to hear from you.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit