How To Handle Imbalanced Data in Classification

Classification is a Machine Learning task that is often used to solve critical business problems. Be it predicting customer churn or fraudulent transactions, ML helps businesses take the right actions at the right time. 

Solving problems with real-world data is not always a straightforward task as it involves dealing with imbalanced classes. Not dealing with imbalanced classes will lead to a poorly performing machine learning model that negatively impacts business decisions.

In this blog, we introduce the classification task and the imbalanced data set problem associated with it. Furthermore, we also summarize multiple considerations that need to be taken while building a machine learning (ML) model with such data. 

This article will be helpful to beginner data scientists and product managers who are trying to solve classification tasks.

What is Classification?

Classification is a machine learning task where you categorize the data into a set of predefined classes. If the target you are predicting has only two classes, then it is a binary classification problem and if the target has more than two classes, it is a multi-class classification problem.

For example, categorizing an email as spam or not spam is a binary classification task while classifying a vehicle’s image as SUV, sedan, van, etc. is a multiclass classification task. 

There is another classification type known as multilabel classification where the target variable accepts multiple values out of predetermined classes. Predicting multiple tags associated with a StackOverflow question is an example of a multilabel classification problem. 

What is Imbalanced Data and Why is it Important to Handle it?

To build a machine learning (ML) model, we need training data so that the model can learn from it. ML models learn well or generalize well when they are trained on sufficient data representing all the classes involved. However, this training data is often collected from real-world applications where all the classes are not present equally.

In some cases, the available data is heavily skewed toward only one class, leading to an imbalanced data problem.

To elaborate, in the case of fraud detection, data scientists use historical data to build a binary classification model. In practice, one would find very few transactions that represent a fraud case while the majority would be normal transactions.

When a model is trained directly on this data, the model performance will be poor as the model is overexposed to non-fraudulent transaction data. Worse yet, the standard accuracy metric is likely to be high because a model that always predicts cases to be non-fraudulent would be correct for the vast majority of cases.

It is thus very important to handle the imbalanced data problem.

How to Handle Imbalanced Data?

In this section, we’ll summarize how to deal with the imbalanced data at each of the model-building steps.

Data Sampling

Real-world data can be massive with millions of rows and hundreds of features (columns) associated with data points. With huge amounts of data, using 100 percent of the available data for training and testing can be inefficient. Hence, we sample the data while also dealing with the imbalanced class problem. You can use the following techniques:

Random Undersampling: In this method, we randomly remove the data points related to the majority class. This leads to fewer examples of the majority class in the training data and makes our data closer to balanced. Typically, the distributions are made closer to a 90:10 split between the majority and minority classes.

Random Oversampling: In this method, we randomly duplicate the data points related to the minority class. While duplicating the minority class, it is important to check the data quality of that class. If a junk data point from the minority class is replicated multiple times in our training set, it will have an impact on model performance.

Synthetic Data Creation: In this method, we create additional data points of the minority class by considering the existing data points of the minority class. The most common technique used for data creation is Synthetic Minority Oversampling Technique (SMOTE). SMOTE works by finding the nearest neighbors of a randomly selected data point from the minority class, then randomly selecting one of the neighbors and placing a new synthetic data point between the neighbor and the original data point. This leads to more examples of the minority class in our training set.

Model Training

Data scientists can use some of the model parameters to improve ML model performance when dealing with imbalanced classes as mentioned below:

Class Weights: Just like the scale_pos_weight parameter in the XGBoost model, scikit-learn offers the class weight parameter in most of its classification models. Adjusting class weights allows you to heavily penalize mistakes from a certain class, thus putting more emphasis on that class. If nothing is specified, all classes are given an equal weight of 1 The `balanced` option is the best way to start as it assigns the weights to the classes inversely to the count of the classes. This more heavily penalizes minority class errors. 

Scale_pos_weight: This is a parameter in the XGBoost model that scales errors made on the positive class (e.g., minority class or class 1) which causes the model to over-correct them. This can lead to better performance in the positive class 1. A typical value for this parameter is negative class count over positive class count.

Model Evaluation

Checking the Test Data Distribution: Usually, we select a time period of data and split it into training and testing data randomly using a proportion (say 80 percent for training). It is important to check the distribution of the classes in the test data after the split. What if your test data has only 1 data point for the minority class? The evaluation metrics will be very skewed with such a split. In such cases, increase the data available for the test split to have more coverage for the minority class.

Using the Right Metrics: Accuracy is almost never the right metric for imbalanced classification problems. Use precision if predicting the minority classes is important. If you are comparing the different classification models, use the precision-recall AUC to decide the best model as the precision-recall curve focuses on the minority class.

Post-Processing Predictions: Since the data sampling stage alters the distribution of the data on which the model is trained, the predictions need to be calibrated so that they stay close to the original distribution of the classes. This step involves multiplying the predictions with the original class distributions.

Practical Tips

  1. The SMOTE technique presented in the data sampling section can be computationally intensive. Furthermore, validate the data generated by SMOTE technique against your business rules. For example, if you have a feature describing, “number of transactions” this means the feature cannot be negative.
  2. Random undersampling is the most commonly used technique as it is faster to implement and provides good results in most of the cases.

Conclusion

Imbalanced class problems are a common scenario in real-world classification use cases. If not handled correctly, imbalanced class data can lead to a poorly performing machine learning model and ultimately impact the business use cases.

Need help unlocking more value from your machine learning initiatives? Contact the ML experts at phData today for advice, questions, and strategy!

More to explore

Accelerate and automate your data projects with the phData Toolkit

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.