Aim: Implementation of K-Means Clustering

Program:


import numpy as np
from sklearn.datasets import make_blobs

class KMeans:
    def __init__(self, n_clusters, max_iters=100):
        self.n_clusters = n_clusters
        self.max_iters = max_iters

    def fit(self, X):
        # Initialize centroids randomly
        self.centroids = X[np.random.choice(X.shape[0], self.n_clusters, replace=False)]

        for _ in range(self.max_iters):
            # Assign each data point to the nearest centroid
            labels = self._assign_labels(X)
            # Update centroids
            new_centroids = self._update_centroids(X, labels)

            # Check for convergence
            if np.all(self.centroids == new_centroids): 
                break
            self.centroids = new_centroids

    def _assign_labels(self, X):
        # Compute distances from each data point to centroids
        distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)

        # Assign labels based on the nearest centroid
        return np.argmin(distances, axis=1)

    def _update_centroids(self, X, labels):
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(self.n_clusters)])
        return new_centroids

# Generate synthetic data using sklearn.datasets.make_blobs
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Create a K-Means instance with 3 clusters
kmeans = KMeans(n_clusters=3)

# Fit the model to the data
kmeans.fit(X)

# Get cluster assignments for each data point
labels = kmeans._assign_labels(X)

print("Cluster Assignments:", labels)
print("Final Centroids:", kmeans.centroids)

Output:

Cluster Assignments:
[0 0 2 2 0 2 2 2 2 2 2 2 2 2 0 2 0 2 2 2 2 2 2 0 2 1 0 2 2 2 2 2 0 2 0 2 0 2 0 2 2 2 0 2 2 2 0 2 0 2 2 1 0 2 1 2 0 2 2 2 1 2 2 1 1 2 2 0 0 2 2 0 0 2 2 1 0 2 2 2 2 2 0 2 2 0 1 2 2 2 0 2 1 2 2 0 0 2 0 1 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 0 2 0 2 2 2 2 1 0 0 1 2 1 0 2 2 2 2 2 2 2 0 2 0 2 2 1 2 2 2 2 2 2 2 2 1 2 0 2 2 2 0 0 2 2 0 1 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 2 2 0 2 2 2 2 1 1 2 0 2 2 2 2 2 1 2 2 2 2 2 2 0 0 2 2 2 0 2 2 0 1 1 1 2 0 0 2 0 0 2 2 0 1 2 2 2 1 2 0 2 2 0 2 1 0 1 2 2 2 1 2 2 2 0 2 0 2 0 2 2 0 2 2 0 2 2 2 0 2 2 2 1 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 0 2 1 2 2 1 2 2 2 2 0 2 2 2 1 1]

Final Centroids:
[[-7.36858243 -7.40836171]] # Centroid for Cluster 0
[[-6.05855368 -6.26139533]] # Centroid for Cluster 1
[[ 1.05693535 5.52708203]] # Centroid for Cluster 2

Sample Viva Questions:

1.What is K-Means clustering?

K-Means clustering is the most popular unsupervised machine learning algorithm. K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. In this kernel, I implement K-Means clustering to find intrinsic groups within the dataset that display the same status_type behaviour.

2. Understanding the K-Means Algorithm

The K-Means algorithm is relatively simple yet effective. Here are the basic steps:

Choose the number of clusters, k, that you want to
Initialize k cluster centroids
Assign each data point to the nearest centroid, creating k
Recalculate the centroids as the mean of all data points in each
Repeat steps 3 and 4 until convergence (centroids no longer change significantly) or for a specified number of iterations.

3. Write the Applications of clustering ?

K-Means clustering is the most common unsupervised machine learning algorithm. It is widely used for many applications which include-

Image segmentation
Customer segmentation
Species clustering
Anomaly detection
Clustering languages

4. K-Means Clustering intuition

K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. It is based on centroid-based clustering.

Centroid – A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster. K-Means clustering works as follows: – The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids. The algorithm then iterates between two steps: –

Data assignment step: Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids in set C, then each data point is assigned to a cluster based on minimum Euclidean distance. Centroid update step :In this step, the centroids are recomputed and updated. This is done by taking the mean of all data points assigned to that centroid’s cluster.

5. What is the objective of K-Means clustering?

K-means clustering begins with the description of a cost function over a parameterized set of possible clustering, and the objective of the clustering algorithm is to find a minimum cost partitioning (clustering). The clustering function is turned into an optimization problem under this model.