DBSCAN Clustering
Clustering is a powerful unsupervised learning technique. Its purpose is to identify specific groups within data that has no prior set labels. I learned about various methods like K-Means Clustering and Hierarchical Clustering, so I decided to take this opportunity to dig deeper and learn something new, for me at least. I am going to explore the mechanics behind DBSCAN clustering, another powerful technique.
What is DBSCAN Clustering?
DBSCAN is an acronym that stands for Density Based Spatial Clustering of Applications with Noise. The key to understanding this is the phrase ‘density based’. Data are clustered into groups based on the dense concentration of surrounding points.
What parameters are necessary to perform DBSCAN?
For this algorithm, there are parameters that must be specified in order to proceed: Epsilon and minimum samples. Epsilon is the radius determined to find neighboring points. If a point falls within the radius of another it will belong to that neighborhood. Minimum samples is the minimum number of points per neighborhood. This will all be clearer in a visual below, but first, let’s talk about the data points.
What are the three types of data points in DBSCAN?
With DBSCAN’s methods, your data consists of three types of points: Core, Border, and Outlier points. A core point has the minimum samples met within its neighborhood. Each individual point will be tested with the same epsilon, so not all points will have the minimum samples met. A border point does not have the minimum samples met, but it is in the neighborhood of a core point. You know a cluster is coming to an end when border points are found. Finally an outlier has an epsilon where no samples are in its neighborhood and is not near a core point.
Comparison to K-Means and Hierarchical
Unlike K-Means, we do not specify the number of clusters we want. However it is important to note that the radius we choose can have an affect on the number of clusters formed. With Hierarchical Clustering, it tends to be sensitive to outliers, that is not the case with DBSCAN as outliers are considered in its clustering process. So, if you don’t want to specify the clusters and know outliers are present, DBSCAN is another method to use.
How do I do this in Python?
Thanks to scikit-learn DBSCAN is easy to implement:
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans#load your data -- I am creating random data for this
X, Y = make_blobs(centers=2)#import DBSCAN & instantiate it with parameters
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=1, min_samples = 2)
clusters = dbscan.fit_predict(X)#plot it
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap="plasma");
In Conclusion:
DBSCAN is another clustering technique used in unsupervised learning. In this technique we do not specify the number of clusters nor do we need to worry about the presence of outliers. Just remember to specify an epsilon and minimum samples and then you can let sci-kit learn do the work for you.
References:
This is a really great video → https://www.youtube.com/watch?v=sJQHz97sCZ0
https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf
https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/