February 18, 2022

How to Estimate ROI for AI and ML Projects

By Charlie Isaksson

Like any other type of investment, Machine Learning (ML) and Artificial Intelligence (AI) projects come with risks and returns. One of the driving forces behind making smart investment decisions is estimating expected returns before deploying such technologies. 

Return on Investment (ROI) is a financial ratio of an investment’s gain or loss relative to its cost. In its simplest form, when you invest in AI, the benefits should outweigh the costs. But that is not always the case. In 2006, Netflix Prize, a machine learning competition, offered $1 million to the team that could improve its recommendation engine by 10 percent. 

This accuracy was achieved in 2009 as the outcome of a competition. However, Netflix decided not to deploy the algorithm due to the high engineering effort needed to deploy the algorithm into the production environment.    

Doing analytics right starts by investing time to ask the right questions that will lead to a better outcome. Implementing any tech-related initiative requires organizations to answer many questions: 

  • What business problems do we want/need to solve?
  • Are these business goals realistic?
  • What is the impact of deploying the proposed solution?
  • Can we determine the optimal use of IT resources for solving this problem?
  • What data is necessary for the problem we want to solve?
  • Is the data available?
  • Have we engaged with the stakeholders that should influence the project? 
  • Do we have the funding for this project?
  • What is the predicted payback period?

In this post, we will examine a simple data science (DS) use case and break down a standard formula to compute ROI. The intent is to help decision-makers maximize their ROI by reducing uncertainty!   

Data Science Life-Cycle

It is essential to talk about some definitions in the data science life-cycle in order to understand its iterative nature. The Figure below yields a typical data science workflow.

At its core, it is the data science interdisciplinary field that overlaps with multiple scientific domains i.e, math and statistics, machine learning, software engineering, visualizations, databases and data processing, etc. The life-cycle is the process over the different domains, such as:

  • Business Understanding: Working with business partners to identify company needs, assess current and future states, and determine the data science goals and project plans.
  • Data Exploration and Preparation: Identifying the data source and accessing the data quality, data governing, tooling, and infrastructure. Select features, create new features, and clean missing values. Finally, convey all the insights to the business by performing data visualizations and data profiling.
  • Modeling: Selecting the modeling techniques (create a baseline model), generating test design, building models, and model assessment.
  • Evaluation: Checking common risks, evaluating results, reviewing processes, and determining the next steps.
  • Deployment: Planning the deployment, monitoring, and performance; producing the final report; and reviewing the project with the business.
A visualization of an atom that has "data science" written in the center followed by a number of orbiting data science subfields like "visualizations" and "machine learning."

The data science workflow is a highly iterative process. At any time, things can change, causing the scoping to become more challenging. The work is repeated or augmented until a clear set of insights are available, and deemed sufficient for the project stakeholders. One aspect is very clear from the above Figure. Data exploration and preparation are the foundation for the life-cycle.

When data is analyzed properly, models achieve higher performance much faster and the reward is clear. On the contrary, failing to realize the risks from the dataset earlier, can be very costly for obvious reasons.

How to Reduce the Risk of Investment in AI/ML

There are several ways to reduce the risk from AI-based applications.  

Design AI Development Methodologies 

Design AI development methodologies relate to the initial scoping of the project. Whether using agile, waterfall, or some hybrid for project and risk management, planning is best done together with the business stakeholders. This stage in planning is the greatest opportunity to identify all the use-cases and business opportunities available for the business. This is where hands-on experience can help the business to quickly move from inception to fully actionable stories. The breakdown of the thinking process helps the business to deeper understand its use-cases by dividing the problem into smaller parts. 

Proof of Concept 

One important concept in tech-related initiatives is the proof of concept (POC). This is a quick way to create a small-scale AI/ML project without having all the bells and whistles. The aim is to prove that the final project will achieve the expected value on a tight budget and most importantly, in a short time. 

It is better to fail quickly in order to succeed sooner. Data science projects are naturally iterative, which helps the business to focus on evaluating the data and the AI/ML models. It enables the business to decide at an early stage whether AI/ML on production would give the desired value and justified investment. Measuring the performance of a POC solution can also improve ROI estimates for future investments.

Steps to Measure the ROI of AI projects

The process to determine the ROI of AI projects isn’t always so simple. To start, we have to count the costs incurred from infrastructure (on-premise or cloud), such as: hardware, software, power consumption, and licensing. In addition, the cost from processing, storing, and managing large amounts of data

Finally, employee compensation substantially depends on the complexity of the project. At a minimum, you need one of each: a project manager, a data engineer, and a data scientist/machine learning engineer. The number of employees, and subsequently, the cost, increases with the complexity of the project. The Figure below shows the data and model life-cycle. The more data obtained, the better the model and usage is, which typically translates to higher revenue. However, the cost increases with higher storage and data management.    

A circular graphic that has 3 parts: "More usage, more data, smarter model" with "higher costs and higher revenue" located on opposite sides of the circle.

Estimating the ROI

In this section, we’ll dive deeper into the topic. Machine learning enables businesses to automate many of their manually performed tasks.  When performing AI algorithms such as forecasting, classification, or clustering, the aim is to save time and allow employees to focus on more relevant tasks. For example, improving customer retention, better quality of service, and helping to minimize mistakes that materialize from performing multiple tasks in a fast-paced trend. 

Most AI algorithms include various ways to measure the performance on how well the algorithm predicts the response variable (the target that we are trying to predict, ex: hospital readmission). Classification accuracy is a metric that is frequently used.

Classification Accuracy is what we usually mean when we use the term accuracy. It is the ratio of the number of correctly predicted to the total number of input samples. The formula below estimates the profit per prediction:

Where â denotes adjusted saving (profit per prediction), a represents the expected saving, Ι is the computed average accuracy (we get that from training a model) and e is the cost of manually fixing a mistake.

To get the adjusted savings â, we have to account for the ratio from the incorrectly predicted (1 – accuracy) to the cost of making a mistake. The adjusted savings give us the actual savings after removing the number of mistakes. However, the simplicity from the above equation comes with a high risk, as we rely entirely on the performance of the algorithm.

What if the algorithm can advise on the confidence of its prediction to reduce the risk? It turns out, the majority of algorithms include such an approach. The idea of using the confidence of prediction is to trust the highest confidence predictions for both the positive and negative classes. And manually processing from the highest uncertainty predictions, which naturally occurs at 50 percent.

It sounds awkward to allow a certain amount of prediction to be manually evaluated, but fixing a mistake after prediction is generally more expensive than manually processing the few from the high uncertainty segment (more on this in the next section).

The steps to achieve a more robust estimation algorithm are just slightly more involved than the previous equation. These are the steps:

  • Selecting the split-threshold: χ , that divides the high and low confidence predictions
  • Run the algorithm and get predictions
  • Compute the confidence score for each prediction
  • Filter out entries that are most uncertain and most costly in a way that satisfies the split-threshold χ
  • Split the predictions based on the user-defined split-threshold
  • From the remaining dataset, compute the confidence accuracy score: Î
  • Apply the equation below to get the adjusted savings: â
A complex formula

Where â denotes adjusted saving (profit per prediction), a represents the expected saving, Πis the computed average confidence accuracy, e is the cost of manually fixing a mistake, χ is the user detained split-threshold, and ê is the cost of manual review.

The above equation returns the adjusted savings â after removing low confidence predictions and adding in the cost of manual review, which is influenced by the split-threshold: (1 – χ).

One question may arise on how to set the split-threshold χ? Well, that depends on how accurate the algorithm is. In case your algorithm provides high accuracy (assuming 100 percent accuracy in this case) then the part highlighted in red from the above equation is not needed and can be omitted. Note, the part highlighted in green will yield the same as the first equation.

In reality, the accuracy is usually much lower and for that reason, we need to split the predictions into two segments. In case the algorithm returns lower accuracy, we have to increase the amount of manual review and clearly lower the trust on the algorithm highlighted in green. In this example, we will use a 90/10 split ratio, meaning 90 percent of the predictions will be trusted and 10 percent will be manually reviewed.

So, at what accuracy can we expect concrete profits? By setting the adjusted savings â to zero and solving the accuracy from the above equation, we get the break-even accuracy.

The above equation returns the average percentage accuracy, where any amount above it will yield a tangible saving.   

The above equations help us to estimate the ROI. Obviously, the initial cost of processing, data handling, and people are the costs of AI initiatives that we didn’t account for in the equation. However, the equation can help us optimize the accuracy in contrast to development cost. The equations return the time saved by the AI model and as we all know, time is money! 

Time is worth different amounts for different organizations and needs to be converted to capital. Only then can we determine the payback period from the initial investment and the recurring costs. The next section goes over a practical example to illustrate the above equations.

Real-World Examples

It shouldn’t be a bombshell that AI generalization has given rise to the expansion of ML algorithms in nearly every sector of our lives, including but not limited to: employment, healthcare, entertainment, transportation, insurance, and marketing.

Customer retention is one of the primary growth pillars for products with a subscription-based business model. Customer churn is a tough problem to tackle in a market where the customers have plenty of providers to choose from. No algorithm will be able to predict churn with 100 percent accuracy, so there will always be a tradeoff between precision and recall. Now let’s consider precision and recall as they relate to churn.

  • Precision – Of all the customers that the algorithm predicts will churn, how many of them actually do churn?
  • Recall – What percentage of customers that churned does the algorithm successfully find?

It is evident that both precision and recall are important for evaluating the performance of a churn prediction algorithm. Imagine a situation where low precision is achieved and a re-engagement campaign is sent to happy customers. Of course, that would be less than ideal as we exclusively like to send it for the actual churning customers. On the other hand, sending a rebate campaign to entice the churning customers is less concerning if happy customers receive it.

In this case, recall can be higher than precision. This is where ROI estimation helps businesses to regulate and optimize the benefit between precision and recall. Another crucial issue is the cost of hospital readmission. In this section, we will go through details to estimate the ROI for hospital readmission.

Hospital Readmission

The cost of hospital readmission accounts for a large portion of hospital inpatient services spending. Diabetes is not only one of the top 10 leading causes of death in the world but also the most expensive chronic disease in the United States. 

A graphic showing the hospital exit on one side and the entrance on the other with a patient and doctor in the middle.

Hospitalized patients with diabetes are at higher risk of readmission than other patients. Therefore, reducing readmission rates for diabetic patients has a great potential to reduce medical costs significantly. For the example in this blog post, we use the dataset obtained from the Center for Machine Learning and Intelligent Systems at The University of California, Irvine, which contains over 100,000 attributes and 50 features. The dataset can be found on Kaggle.

The dataset contains protected health information (PHI), see table below.

A sample diabetes dataset from the Center for Machine Learning and Intelligent Systems at The University of California, Irvine

And Non-PHI columns:

A sample diabetes dataset from the Center for Machine Learning and Intelligent Systems at The University of California, Irvine that shows non-PHI columns.

We start by selecting a machine learning algorithm. Any classification library can be used. We opted to use the XGBoost classification model. The size of our training data is 91589 rows and 10177 rows for testing. The code below shows the algorithm parameters and the method to split the data into training and testing. Then, it trains the model and computes the performance report.   

					xgb_params= {'n_estimators': 2000, 
             'max_depth': 9, 
             'learning_rate': 0.0201, 
             'reg_lambda': 29.326, 
             'subsample': 0.818, 
             'use_label_encoder': False,
             'colsample_bytree': 0.235, 
             'colsample_bynode': 0.820, 
             'colsample_bylevel': 0.453}

alg = XGBClassifier(**xgb_params)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=1)
accuracy_mlp, auc_mlp, m = fit_model(alg, X_train, X_test, y_train, y_test, reports=True)

The above code yields the prediction results from the performance report. 

Prediction results from the diabetes dataset

The accuracy of our model is an impressive 86 percent (see the Figure above). If we use the equation without the confidence scores with these made-up assumptions:

Then we can expect to save 0.099 minutes of work per claim. By processing 10177 claims, we can save 17 hours of work. Not bad! After all, the accuracy is 86 percent. 

Equation with Confidence Scores

Before we dive into the next example, we have to clarify the predictions that come out from XGBoost. The XGBoost classification model can directly predict the label (i.e the hospital readmission) from a given observation. The model can alternatively predict the probability of an observation belonging to each possible class label, and provide flexibility to set a threshold of the prediction uncertainty. 

Depending on the model used, i.e. complex nonlinear ML algorithms, the predicted probabilities may not match the expected distribution of observed probabilities in the training data because of the use of approximations.

This issue can be solved by adjusting the probabilities to better match the expected distribution observed in the data. This capability is referred to as calibration.    

Calibration represents the predicted probabilities that match the expected distribution of probabilities for each class. The code below aid in the probabilities calibration.     

					def calibrated(trainX, testX, trainy):
    # define model
    model = XGBClassifier(**xgb_params)
    model.fit(trainX, trainy)
    # define and fit calibration model
    calibrated = CalibratedClassifierCV(model, method='sigmoid', cv="prefit")
    calibrated.fit(trainX, trainy)

    # predict probabilities
    return calibrated.predict_proba(testX)[:, 1], calibrated

# calibrated predictions
yhat_calibrated, mod = calibrated(X_train, X_test, y_train)
fop_calibrated, mpv_calibrated = calibration_curve(y_test, yhat_calibrated, n_bins=10)


A reliability diagram is a line plot of the relative frequency of what was observed (y-axis) versus the predicted probability frequency (x-axis). The predicted probabilities are divided up into a fixed number of buckets along the x-axis. The number of events (class=1) are then counted for each bin. Then the observed frequencies are normalized. See the reliability diagrams below. 

A diagram showcasing reliability

The blue line represents the typical S-shaped curve of an uncalibrated model with conservative predictions against the dashed line that represents the perfectly calibrated model. The orange line represents the calibrated probabilities. The calibrated model fits the dashed line much better than the uncalibrated model, although still over-forecasting in the upper quadrant as the probabilities are below the diagonal line, meaning the probabilities are too large.   

Now let’s look at the equation that considers confidence scores with 90/10 confidence split. We use the same made-up assumptions from the previous example with some additional variables.

The code below filters out 10 percent of the highest uncertainty predictions centered on a 50 percent probability. We use the calibrated model from above to compute the accuracy from the remaining 90 percent claims. 

					XTest = X_test.copy()
XTest['confidence'] = yhat_calibrated
XTest['y'] = y_test

def remove_uncertainty(x, percentage=10.0, uncertainty_level=0.50):
    x['confidence'] = np.abs((x['confidence'] - uncertainty_level)/uncertainty_level)
    count = x.shape[0] - int(len(x)*(percentage/100))
    return x.nlargest(count, ['confidence'])

gData = remove_uncertainty(XTest)

YTest = gData.pop('y')
confidence = gData.pop('confidence')
XTest = gData

pre_fit_model(mod, XTest, YTest, reports=True)

We get an impressive 90 percent prediction results from the below performance report. 

Prediction results

This blog post shows a simple method to filter out a small percentage of the high uncertainty claims for a manual review. To find the most optimal algorithm is out of scope for this blog. Instead, we focused on the significance to filter out a few claims with high uncertainty in order to reduce the cost of fixing mistakes. 

The table below shows the testing dataset with a combined confidence column. 

The effect from incorporating the confidence provides a 7X improvement. With every prediction, we can expect to save 0.85 minutes of work. Processing 10177 claims, we can save 144.2 hours of work even with the cost of mistakes and manual reviewing 10 percent of the predictions. 

With just 4.7 percent improvement in accuracy, we can achieve impressive outcomes. The above equation affirms a remarkable flexibility. We don’t need to blindly trust the ML models predictions, but we have a way to mathematically regulate the predictions with high uncertainty.   

We can also see from the above equation the break-even accuracy is at 87 percent. In our case, we obtained a 90 percent accuracy, which gives us concrete profits.  


Organizations that do simple ROI calculations for an AI project often fail to consider the uncertainty associated with realizing the benefits. 

A perplexing factor in AI models is their likelihood to have errors, meaning their accuracy is probably less than 100 percent. As a result, we need to estimate both the savings and the cost of making mistakes. In order to compute the savings, we need to compare a baseline of human performance against the AI model’s performance. Also, since the real world is messier than a training environment, any errors could be more pronounced in production.

We hope that we were able to shine some light on this crucial topic. High accuracy is golden. At phData, we always advise our customers not to rush after it. In any investment, the return should be more than the cost and the extra accuracy may not yield the justified investment to pursue it. 

In this post, we have armed you with the cognizance to estimate the ROI. Hands-on experience is salient for a successful AI project

At phData, we have experience solving tough machine learning problems and putting robust solutions into production. If you’d like to leverage our insights in your AI initiatives, don’t hesitate to reach out!       

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit