What K-means Clustering Meaning, Applications & Example
An unsupervised algorithm that groups data into k clusters.
What is K-means Clustering?
K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into distinct groups or clusters. The algorithm works by grouping data points into \(K\) clusters, where each data point belongs to the cluster with the nearest mean (centroid). K-means aims to minimize the variance within each cluster, ensuring that data points within a cluster are as similar as possible while being as different as possible from data points in other clusters.
Key Concepts of K-means Clustering
- Centroids: Each cluster is represented by a centroid, which is the mean of all data points in the cluster. The centroids are recalculated during the algorithm’s iterations to improve the clustering.
- Clusters: The data points are grouped into \(K\) clusters. The number \(K\) must be predefined before running the algorithm.
- Assignment Step: Each data point is assigned to the nearest centroid based on a distance metric, typically Euclidean distance.
- Update Step: After the assignment, the centroid of each cluster is recalculated as the mean of all the points assigned to that cluster.
- Convergence: The algorithm repeats the assignment and update steps until the centroids no longer change or the maximum number of iterations is reached.
Applications of K-means Clustering
- Customer Segmentation: K-means is commonly used in marketing to group customers based on purchasing behavior or demographic data, allowing for targeted marketing strategies.
- Image Compression: In image processing, K-means can be used to reduce the number of colors in an image, making it easier to store or transmit.
- Document Clustering: K-means is used to group similar documents together based on text features, facilitating tasks like topic modeling or organizing large collections of text.
- Anomaly Detection : In some cases, K-means can help identify outliers or anomalies by clustering normal data and treating points that don’t fit any cluster well as anomalies.
Example of K-means Clustering
An example of how K-means clustering might be applied using Python’s scikit-learn
library:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Generating some random data
X = np.random.rand(100, 2)
# Applying K-means with 3 clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
# Getting the cluster centroids
centroids = kmeans.cluster_centers_
# Getting the labels (cluster assignments)
labels = kmeans.labels_
# Plotting the data points and centroids
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200)
plt.title('K-means Clustering')
plt.show()
In this example, 100 data points are randomly generated, and K-means is applied to group them into 3 clusters. The centroids of the clusters are displayed as red ‘X’ markers.
This visual representation allows you to observe how the algorithm groups the data and how close the data points are to their respective centroids.