Data Balancing Act
To ensure success for the day, nutritionists will recommend people eat a balanced breakfast. To ensure a successful classification model, data scientists will recommend that your dataset is balanced. What this means is that your dataset should have approximately equal representation of the categories/classes present. If you are performing a classification algorithm on data that is imbalanced, the model would possibly be comprised when introduced to new data. The model would believe that a typical dataset would look like the imbalanced one you are training on and most likely make tons of errors. Not to worry, there are methods we utilize to combat this imbalance. I am going to speak about those methods here.
In order to achieve a balanced dataset, we should consider resampling. There are two major ways to achieve this. These ways are called upsampling and downsampling.
Downsampling (or Under-sampling)
‘Under-sampling balances the dataset by reducing the size of the abundant class.’ — as explained by Kdnuggets.com
When you have a minority class, a way to achieve balance is to exclude information from the majority class. This will create a more equitable distribution of the classes in the data. It is only recommended to do so when you have a large amount of data. Generally speaking, the larger the dataset, the better.
Upsampling (or Over-sampling)
Another way to achieve balance in a majority filled dataset is to increase the size of the minority class. Keep in mind that it is better to do this if you do not have a lot of data to begin with. This will not only create equity, but it will also increase your overall sample size.
Both of these techniques are very easy to implement in Python:
minority = df[df['target']=='minority']
majority = df[df['target']=='majority']#this class in the sklearn.utils module can handle both
from sklearn.utils import resample# DOWNSAMPLING/UNDERSAMPLING
downsample_data = resample(minority,
replace=True, # sample with replacement
n_samples=len(majority), # match number in majority class
random_state=23) # reproducible results# UPSAMPLING/OVERSAMPLING
upsample_data = resample(majority,
replace=True, # sample with replacement
n_samples=len(minority), # match number in majority class
random_state=23) # reproducible results
Tomek-links
Tomek-links is a way to go about undersampling. It is more calculated than traditional undersampling done with random resampling. The process of Tomek-links has a similar quality to a nearest neighbor algorithm. It will look at pairs of close points that are of opposite classes and essentially removes the majority class.
SMOTE
SMOTE, or Synthetic Minority Over-sampling Technique, is another way to go about oversampling. Again, it is more calculated than traditional oversampling. How this process works is by randomly choosing points from the minority class, k-nearest neighbors is computed and synthetic points are added in between. This gives, in my opinion, a more realistic approach to oversampling.
These two techniques described above have their own classes in the imblearn module:
#it is important to remember if the technique you are doing requires oversampling or undersampling!from imblearn.under_sampling import TomekLinks
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)#TOMEK-LINKS
tl = TomekLinks()
X_res, y_res = tl.fit_resample(X_train, y_train)#SMOTE
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_sample(X_train, y_train)
These two methods are so simple to perform in code, and the user does not have to even identify the majority or minority class. This makes the process even easier for anyone to implement.
Wrapping it up, it is crucial to balance you dataset for most classification algorithms. If you don’t there could be serious consequences in your results. Thankfully, there are ways to handle this. It is important to remember that if you have a lot of data and some to spare, under sampling is a great option for you. If your dataset is rather small, it would be better to over sample the minority class. This will not only make your dataset larger, but will balance your classes. I would recommend even more to go with Tomek-links and/or SMOTE to give yourself a more realistic approach to strategically deleting the majority or adding to the minority.
References:
Upsampling and Downsampling — https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
SMOTE Documentation — https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
Tomek-links Documentation — http://glemaitre.github.io/imbalanced-learn/auto_examples/under-sampling/plot_tomek_links.html