CS 5805 - Clustering

Clustering

Content

Introduction
What is Clustering
Why Clustering
Types of Clustering Methods/ Algorithms

Introduction

It is essentially an unsupervised learning strategy. Unsupervised learning is a method that draws references from datasets of input data without labelled answers. It is commonly used as a procedure to discover significant structure, explanatory underlying processes, generative qualities, and groups inherent in a set of instances.

What is Clustering

Clustering is the process of separating a population or set of data points into groups so that data points in the same group are more similar to other data points in the same group and different to data points in other groups. It is essentially a collection of items based on their similarity and dissimilarity.

Clustering is a form of unsupervised machine learning approach in which comparable data points are grouped together based on particular attributes or qualities. Clustering seeks natural groups or patterns within a dataset without the requirement for labelled output.

Why Clustering

When working with enormous datasets, dividing the data into logical groupings, termed clusters, is an effective approach to analyse them. You could extract value from a big quantity of unstructured data in this manner. It allows you to quickly scan the data for patterns or structures before delving further into the data for particular results.

Clustering is critical because it determines the intrinsic grouping of the unlabeled data provided. There are no requirements for good clustering. It is up to the user to determine what criteria will fulfil their needs.

Pattern Recognition:
Clustering assists in identifying natural groups or patterns within data that may not be immediately evident. It enables the finding of innate structures and linkages.
Data Exploration and Understanding:
Data Distribution Insight: Clustering reveals how data points in a collection are distributed and categorised. This is useful for exploratory data analysis and comprehension of the underlying structure.
Segmentation and Customer Profiling:
In business and marketing, clustering is often used to segment customers. It helps to identify groups of customers with similar behavior, preferences or buying habits, which enables targeted marketing strategies.
Anomaly Detection:
Outlier Detection: Clustering can be used to identify unusual patterns or outliers in a data set. Data points that do not match the clustering patterns assigned to them can be considered outliers.

Types of Clustering Methods/ Algorithms

Clustering Methods/ Algorithms	Method	Description	Advantages	Disadvantages
K-Means Clustering	Partitioning	Divides the dataset into a specified number (k) of clusters.	Simple and computationally efficient.	Sensitive to initial cluster centroids.
Hierarchical Clustering	Agglomerative (bottom-up) or divisive (top-down)	Builds a tree-like hierarchy of clusters	Provides a hierarchy of clusters, visualized using dendrogram.	Computationally more intensive.
DBSCAN	Density-based	Forms clusters based on the density of data points	Can discover clusters of arbitrary shapes and handles noise well	Sensitive to parameter settings
Mean Shift	Centroid-based	Iteratively shifts centroids towards the mode of the data distribution	Can find irregularly shaped clusters and adapt to varying densities	Computationally expensive
Gaussian Mixture Model (GMM)	Probabilistic	the data points are generated from a mixture of several Gaussian distributions	Can model complex data distributions and provide probabilistic cluster assignments	Sensitive to the initial parameters.
Mean Shift	Centroid-based	Iteratively shifts centroids towards the mode of the data distribution. The resulting clusters are regions of high data density	Can find irregularly shaped clusters and adapt to varying densities	Computationally expensive
Agglomerative Clustering	Hierarchical, bottom-up	Starts with individual data points as separate clusters and iteratively merges the closest clusters until a stopping criterion is met	Can handle different shapes and sizes of clusters	Computationally more intensive

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv('IMDB.csv')

df.head()

	Name	Year	Episodes	Type	Rating	Image-src	Description	Name-href
0	1. Breaking Bad	2008–2013	62 eps	TV-MA	9.5	https://m.media-amazon.com/images/M/MV5BYmQ4YW...	A chemistry teacher diagnosed with inoperable ...	https://www.imdb.com/title/tt0903747/?ref_=cht...
1	2. Planet Earth II	2016	6 eps	TV-G	9.5	https://m.media-amazon.com/images/M/MV5BMGZmYm...	David Attenborough returns with a new wildlife...	https://www.imdb.com/title/tt5491994/?ref_=cht...
2	3. Planet Earth	2006	11 eps	TV-PG	9.4	https://m.media-amazon.com/images/M/MV5BMzMyYj...	A documentary series on the wildlife found on ...	https://www.imdb.com/title/tt0795176/?ref_=cht...
3	4. Band of Brothers	2001	10 eps	TV-MA	9.4	https://m.media-amazon.com/images/M/MV5BMTI3OD...	The story of Easy Company of the U.S. Army 101...	https://www.imdb.com/title/tt0185906/?ref_=cht...
4	5. Chernobyl	2019	5 eps	TV-MA	9.4	https://m.media-amazon.com/images/M/MV5BNTdkN2...	In April 1986, an explosion at the Chernobyl n...	https://www.imdb.com/title/tt7366338/?ref_=cht...

df.columns

Index(['Name', 'Year', 'Episodes', 'Type', 'Rating', 'Image-src',
       'Description', 'Name-href'],
      dtype='object')

df['Year']

0      2008–2013
1           2016
2           2006
3           2001
4           2019
         ...    
245        2009–
246    2002–2015
247    2009–2013
248    2014–2015
249    2017–2019
Name: Year, Length: 250, dtype: object

df.dropna(inplace=True)

df.isnull().sum()

Name           0
Year           0
Episodes       0
Type           0
Rating         0
Image-src      0
Description    0
Name-href      0
dtype: int64

plt.figure(figsize=(10, 6))
sns.countplot(x='Type', data=df, palette='viridis')
plt.title('Distribution of TV Shows by Genres')
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

top_rated_shows = df.sort_values(by='Rating', ascending=False).head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x='Rating', y='Name', data=top_rated_shows, palette='Blues_r')
plt.title('Top-rated TV Shows and Their IMDb Ratings')
plt.xlabel('IMDb Rating')
plt.ylabel('TV Show Name')
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

plt.figure(figsize=(12, 6))
sns.swarmplot(x='Type', y='Rating', data=df, palette='dark', size=8)
plt.title('IMDb Ratings Distribution by TV Show Type')
plt.xlabel('TV Show Type')
plt.ylabel('IMDb Rating')
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Temp\ipykernel_16928\3663582833.py:2: FutureWarning:

Passing `palette` without assigning `hue` is deprecated.

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

50.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

23.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\categorical.py:3544: UserWarning:

40.3% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

sns.pairplot(df[['Year', 'Episodes', 'Rating', 'Type']], hue='Type', palette='Set1')
plt.suptitle('Pair Plot of TV Show Data with Type Hue', y=1.02)
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning:

use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.

fig, ax = plt.subplots(1, figsize = (30,8))
ax = sns.scatterplot(x='Year', y='Rating', data=df, hue='Type', palette='Set1', alpha=0.7)
ax.grid()
fig.autofmt_xdate()
plt.xticks(rotation = 90, ha = 'right',
           fontsize = 10)
plt.xlim(0, 178)
plt.title('Correlation between Release Year and IMDb Ratings')
plt.xlabel('Release Year')
plt.ylabel('IMDb Rating')
plt.legend(title='TV Show Type')
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning:

is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

from wordcloud import WordCloud
import matplotlib.pyplot as plt
top_rated_descriptions = " ".join(df['Description'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(top_rated_descriptions)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Top-rated TV Show Descriptions')
plt.show()

numerical_features = df[['Year', 'Episodes', 'Rating']]

df['Year'] = df['Year'].astype(str)
df.loc[:, 'Year'] = df['Year'].str.split('–').str[0].astype(int)

df.loc[:, 'Episodes'] = pd.to_numeric(df['Episodes'].str.extract('(\d+)')[0], errors='coerce')

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

features = df[['Year', 'Episodes', 'Rating']]
features = features.dropna()
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(features_scaled)
print(df[['Name', 'Cluster']])
plt.scatter(features_scaled[:, 0], features_scaled[:, 1], c=df['Cluster'], cmap='viridis')
plt.xlabel('Year')
plt.ylabel('Episodes')
plt.title('K-Means Clustering')
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

                               Name  Cluster
0                   1. Breaking Bad        1
1                2. Planet Earth II        1
2                   3. Planet Earth        1
3               4. Band of Brothers        1
4                      5. Chernobyl        1
..                              ...      ...
240                    241. Gintama        2
241                  242. Queer Eye        0
242  243. The Angry Video Game Nerd        2
243  244. Alfred Hitchcock Presents        2
244               245. The Night Of        0

[245 rows x 2 columns]

X = df[['Year', 'Episodes', 'Rating']].dropna()
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss, marker='o')
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

num_clusters = 4
clusterer = KMeans(n_clusters=num_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
print("Cluster Labels:")
print(cluster_labels)
df['Cluster'] = cluster_labels
print("Data with Cluster Labels:")
print(df[['Name', 'Year', 'Episodes', 'Rating', 'Cluster']])

Cluster Labels:
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1
 0 1 0 0 0 0 0 0 0 3 0 0 3 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 2 0 0 3 0 0 3
 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0
 0 3 0 0 0 1 0 0 0 0 1 0 0 3 0 1 0 1 1 3 0 0 0 0 0 0 0 0 0 3 0 0 0 1 0 1 0
 0 0 1 1 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 3 0 0 1 1 0 1 0 1 0 0 0 3 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1
 0 0 0 0 3 0 0 0 1 1 0 0 0 0 1 0 0 0 3 0 1 3 0]
Data with Cluster Labels:
                               Name  Year Episodes  Rating  Cluster
0                   1. Breaking Bad  2008       62     9.5        0
1                2. Planet Earth II  2016        6     9.5        0
2                   3. Planet Earth  2006       11     9.4        0
3               4. Band of Brothers  2001       10     9.4        0
4                      5. Chernobyl  2019        5     9.4        0
..                              ...   ...      ...     ...      ...
240                    241. Gintama  2005      375     8.7        3
241                  242. Queer Eye  2018       60     8.5        0
242  243. The Angry Video Game Nerd  2004      225     8.5        1
243  244. Alfred Hitchcock Presents  1955      268     8.5        3
244               245. The Night Of  2016        8     8.4        0

[245 rows x 5 columns]

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
    fig, (ax1, ax2) = plt.subplots(1, 2)
    
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)
    sample_silhouette_values = silhouette_samples(X, cluster_labels)
    y_lower = 10
    for i in range(n_clusters):
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10  
    ax1.set_title("The silhouette plot ")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax1.set_yticks([])  
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X.iloc[:, 0], X.iloc[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    centers = clusterer.cluster_centers_
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

For n_clusters = 2 The average silhouette_score is : 0.7841282051476728
For n_clusters = 3 The average silhouette_score is : 0.7443994662641097
For n_clusters = 4 The average silhouette_score is : 0.6346534773168201
For n_clusters = 5 The average silhouette_score is : 0.5596258728689598
For n_clusters = 6 The average silhouette_score is : 0.5557296368153766