Understanding K-means Clustering in Machine Learning

Priyansh Kushwah
3 min readJul 18, 2021

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

AndreyBu, who has more than 5 years of machine learning experience and currently teaches people his skills, says that “the objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”

Now we will understand this with the help of a beautiful figure.

A cluster refers to a collection of data points aggregated together because of certain similarities.

Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

K-means algorithm example problem

Let’s see the steps on how the K-means machine learning algorithm works using the Python programming language.

We’ll use the Scikit-learn library and some random data to illustrate a K-means clustering simple explanation.

Step 1: Import libraries

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans
  • Pandas for reading and writing spreadsheets
  • Numpy for carrying out efficient computations
  • Matplotlib for visualization of data .

# Data input

Here is how the data is displayed on a two-dimensional space:

Step 3: Use Scikit-Learn

In this case, we arbitrarily gave k (n_clusters) an arbitrary value of two.

Real world use cases of K-means clustering in Security domain

1) Intrusion Detection System (IDS)

Any malicious activity or violation is typically reported or collected centrally using a security information and event management system. Anomaly detection is one of intrusion detection system. Current anomaly detection is often associated with high false alarm with moderate accuracy and detection rates when it’s unable to detect all types of attacks correctly.

2) Malware Detection

Clustering detection model by using K-Means clustering approach to detect malware behavior of data based on the features of the malware. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics by studying the behavior of the malware which results in, model is capable to cluster normal and suspicious data into two separate groups with high detection rate which is more than 90 percent accuracy.

--

--