This data set is created only for the learning purpose of the customer segmentation concepts , also known as market basket analysis . I will demonstrate this by using unsupervised ML technique (KMeans Clustering Algorithm) in the simplest form.
Customer segmentation is a process of identifying and describing the different types of customers that exist within a given population. The objective of customer segmentation is to understand each group's needs and behaviors, and to optimize business operations by tailoring the product or service offerings to meet those needs.
The most common technique used to segment customers is clustering. Clustering is a method of grouping customer data so that similar customers are grouped together, while dissimilar ones are placed into separate clusters. K-Means Clustering is an advanced form of clustering, which uses a mathematical model to cluster the data points based on their distance from each other.
import numpy as np
import pandas as pd
from pandas.api.types import is_string_dtype, is_numeric_dtype
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from mpl_toolkits.mplot3d import Axes3D
from yellowbrick.cluster import KElbowVisualizer
from yellowbrick.cluster import SilhouetteVisualizer
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
df = pd.read_csv('Mall_Customers.csv')
df.head()
CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|---|
0 | 1 | Male | 19 | 15 | 39 |
1 | 2 | Male | 21 | 15 | 81 |
2 | 3 | Female | 20 | 16 | 6 |
3 | 4 | Female | 23 | 16 | 77 |
4 | 5 | Female | 31 | 17 | 40 |
df.shape
(200, 5)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null object 2 Age 200 non-null int64 3 Annual Income (k$) 200 non-null int64 4 Spending Score (1-100) 200 non-null int64 dtypes: int64(4), object(1) memory usage: 7.9+ KB
df.isnull().any()
# df.isnull().sum()
# sns.heatmap(df.isnull(),cmap = 'magma',cbar = False);
df.duplicated().sum()
CustomerID False Gender False Age False Annual Income (k$) False Spending Score (1-100) False dtype: bool
0
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
CustomerID | 200.0 | 100.50 | 57.879185 | 1.0 | 50.75 | 100.5 | 150.25 | 200.0 |
Age | 200.0 | 38.85 | 13.969007 | 18.0 | 28.75 | 36.0 | 49.00 | 70.0 |
Annual Income (k$) | 200.0 | 60.56 | 26.264721 | 15.0 | 41.50 | 61.5 | 78.00 | 137.0 |
Spending Score (1-100) | 200.0 | 50.20 | 25.823522 | 1.0 | 34.75 | 50.0 | 73.00 | 99.0 |
#Columns's string treatment and drop unused column
df.columns = [s.strip().replace(' ', '_') for s in df.columns]
df = df.rename(columns = {'Annual_Income_(k$)':'Annual_Income', 'Spending_Score_(1-100)':'Spending_Score'})
# df = df.drop(columns = 'CustomerID')
df.drop('CustomerID', axis=1, inplace=True)
df.columns
Index(['Gender', 'Age', 'Annual_Income', 'Spending_Score'], dtype='object')
categorical_features = []
numerical_features = []
for column in df:
if is_numeric_dtype(df[column]):
numerical_features.append(column)
elif is_string_dtype(df[column]):
categorical_features.append(column)
# print('Categorical Features :', *categorical_features)
print('Categorical Features :', categorical_features)
# print('Numerical Features :', *numerical_features)
print('Numerical Features :', numerical_features)
Categorical Features : ['Gender'] Numerical Features : ['Age', 'Annual_Income', 'Spending_Score']
# Make a function to create numeric plots
def create_numeric_plot(columns):
fig, axs = plt.subplots(len(columns), 2, figsize=(9, 8))
for i, col in enumerate(columns):
sns.boxplot(df[col], ax = axs[i][0])
sns.distplot(df[col], ax = axs[i][1])
axs[i][0].set_title('mean = %.2f\n median = %.2f\n std = %.2f'%(df[col].mean(), df[col].median(), df[col].std()))
plt.setp(axs)
plt.tight_layout()
plt.show()
# Call create_numeric_plot function
create_numeric_plot(numerical_features)
adjustable: {'box', 'datalim'} agg_filter: a filter function, which takes a (m, n, 3) float array and a dpi value, and returns a (m, n, 3) array alpha: scalar or None anchor: (float, float) or {'C', 'SW', 'S', 'SE', 'E', 'NE', ...} animated: bool aspect: {'auto', 'equal'} or float autoscale_on: bool autoscalex_on: bool autoscaley_on: bool axes_locator: Callable[[Axes, Renderer], Bbox] axisbelow: bool or 'line' box_aspect: float or None clip_box: `.Bbox` clip_on: bool clip_path: Patch or (Path, Transform) or None facecolor or fc: color figure: `.Figure` frame_on: bool gid: str in_layout: bool label: object navigate: bool navigate_mode: unknown path_effects: `.AbstractPathEffect` picker: None or bool or float or callable position: [left, bottom, width, height] or `~matplotlib.transforms.Bbox` prop_cycle: unknown rasterization_zorder: float or None rasterized: bool sketch_params: (scale: float, length: float, randomness: float) snap: bool or None subplotspec: unknown title: str transform: `.Transform` url: str visible: bool xbound: unknown xlabel: str xlim: (bottom: float, top: float) xmargin: float greater than -0.5 xscale: {"linear", "log", "symlog", "logit", ...} or `.ScaleBase` xticklabels: unknown xticks: unknown ybound: unknown ylabel: str ylim: (bottom: float, top: float) ymargin: float greater than -0.5 yscale: {"linear", "log", "symlog", "logit", ...} or `.ScaleBase` yticklabels: unknown yticks: unknown zorder: float
df.shape
for i in numerical_features:
Q1 = df[i].quantile(0.25)
Q3 = df[i].quantile(0.75)
IQR = Q3 - Q1
df = df[(df[i]<=(Q3+1.5*IQR))&(df[i]>=(Q1-1.5*IQR))]
df = df.reset_index(drop=True)
df.shape
df.head()
(200, 4)
(198, 4)
Gender | Age | Annual_Income | Spending_Score | |
---|---|---|---|---|
0 | Male | 19 | 15 | 39 |
1 | Male | 21 | 15 | 81 |
2 | Female | 20 | 16 | 6 |
3 | Female | 23 | 16 | 77 |
4 | Female | 31 | 17 | 40 |
df['Gender'].value_counts()
Female 112 Male 86 Name: Gender, dtype: int64
df['Gender'].value_counts(normalize=True)
Female 0.565657 Male 0.434343 Name: Gender, dtype: float64
# Make a function to create categorical plots
def create_categorical_plot(columns):
fig, axs = plt.subplots(len(columns), 1, figsize=(5, 5))
for i, col in enumerate(columns):
sns.countplot(df[col], order = df[col].value_counts().head(10).index, ax =axs)
axs.set_title('Countplot '+ col, fontsize = 20)
plt.xticks(rotation = 0)
#Create annotate
for i in axs.patches:
axs.annotate(format(i.get_height(), '.0f'),
(i.get_x() + i.get_width() / 2., i.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
# Setting Plot
sns.despine(right=True,top = True, left = True)
axs.axes.yaxis.set_visible(False)
plt.setp(axs)
plt.tight_layout()
plt.show();
# Call create_categorical_plot function
create_categorical_plot(categorical_features)
adjustable: {'box', 'datalim'} agg_filter: a filter function, which takes a (m, n, 3) float array and a dpi value, and returns a (m, n, 3) array alpha: scalar or None anchor: (float, float) or {'C', 'SW', 'S', 'SE', 'E', 'NE', ...} animated: bool aspect: {'auto', 'equal'} or float autoscale_on: bool autoscalex_on: bool autoscaley_on: bool axes_locator: Callable[[Axes, Renderer], Bbox] axisbelow: bool or 'line' box_aspect: float or None clip_box: `.Bbox` clip_on: bool clip_path: Patch or (Path, Transform) or None facecolor or fc: color figure: `.Figure` frame_on: bool gid: str in_layout: bool label: object navigate: bool navigate_mode: unknown path_effects: `.AbstractPathEffect` picker: None or bool or float or callable position: [left, bottom, width, height] or `~matplotlib.transforms.Bbox` prop_cycle: unknown rasterization_zorder: float or None rasterized: bool sketch_params: (scale: float, length: float, randomness: float) snap: bool or None subplotspec: unknown title: str transform: `.Transform` url: str visible: bool xbound: unknown xlabel: str xlim: (bottom: float, top: float) xmargin: float greater than -0.5 xscale: {"linear", "log", "symlog", "logit", ...} or `.ScaleBase` xticklabels: unknown xticks: unknown ybound: unknown ylabel: str ylim: (bottom: float, top: float) ymargin: float greater than -0.5 yscale: {"linear", "log", "symlog", "logit", ...} or `.ScaleBase` yticklabels: unknown yticks: unknown zorder: float
plt.figure(figsize=(10,10))
sns.pairplot(data = df,hue='Gender',diag_kind='kde')
plt.show();
<Figure size 720x720 with 0 Axes>
ut = np.triu(df.corr())
lt = np.tril(df.corr())
fig,ax = plt.subplots(nrows = 1, ncols = 2,figsize = (15,5))
plt.subplot(1,2,1)
sns.heatmap(df.corr(),cmap = 'magma',annot = True,cbar = 'True',mask = ut);
plt.title('Correlation Matrix : Upper Triangular Format');
plt.subplot(1,2,2)
sns.heatmap(df.corr(),cmap = 'magma',annot = True,cbar = 'True',mask = lt);
plt.title('Correlation Matrix : Lower Triangular Format');
# Spending vs. Age
X = list()
# X.append(df[["Age", "Spending_Score"]].values)
# Spending vs. Age vs. Annual Income
X.append(df[["Age", "Annual_Income", "Spending_Score"]].values)
N = list()
# N.append(list(df[["Age", "Spending_Score"]].columns))
N.append(list(df[["Age", "Annual_Income", "Spending_Score"]].columns))
K = list()
for i in range(len(X)):
model = KMeans(random_state=42)
# visualizer = KElbowVisualizer(model, k=(2,10), metric='silhouette')
visualizer = KElbowVisualizer(model, k=(2, 20))
visualizer.fit(X[i])
visualizer.show()
# print(visualizer.elbow_value_)
K.append(visualizer.elbow_value_)
model = KMeans(n_clusters = visualizer.elbow_value_, random_state=42)
sil_visualizer = SilhouetteVisualizer(model)
sil_visualizer.fit(X[i])
sil_visualizer.show()
plt.show()
KElbowVisualizer(ax=<AxesSubplot:>, estimator=KMeans(n_clusters=19, random_state=42), k=(2, 20))
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
SilhouetteVisualizer(ax=<AxesSubplot:>, estimator=KMeans(n_clusters=6, random_state=42))
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 198 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# https://stackoverflow.com/questions/65325834/using-a-variable-for-the-group-by-method-in-by-python-pandas
def create_all_summary(df,features,column_to_aggregate,agg_method):
df_output = df.groupby(features)[column_to_aggregate].agg(agg_method)
return df_output
# Numerical Data Cluster Visualization
def cluster_num_plot():
# Numerical Data Cluster Visualization
for i in numerical_features:
plt.figure(figsize=(6,4))
ax = sns.boxplot(x = 'cluster',y = i, data = df)
plt.title('\nBox Plot {}\n'.format(i), fontsize=15)
plt.show();
def cluster_cat_plot():
# Categorical Data Cluster Visualization
for i in categorical_features:
plt.figure(figsize=(9,7))
ax = sns.countplot(data = df, x = 'cluster')
plt.title('\nCount Plot {}\n'.format(i), fontsize=15)
# ax.legend(loc="upper center")
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
sns.despine(right=True,top = True, left = True)
ax.axes.yaxis.set_visible(False)
plt.show();
for i in categorical_features:
plt.figure(figsize=(9,7))
ax = sns.countplot(data = df, x = 'cluster', hue = i )
plt.title('\nCount Plot {}\n'.format(i), fontsize=15)
ax.legend(loc="upper center")
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
sns.despine(right=True,top = True, left = True)
ax.axes.yaxis.set_visible(False)
plt.show();
for i in range(len(X)):
kmeans = KMeans(n_clusters = K[i], init = 'k-means++', random_state = 42)
kmeans.fit(X[i])
df['cluster'] = kmeans.labels_
cluster_num_plot()
cluster_cat_plot()
create_all_summary(df, ['cluster'], N[i], ['count', 'min', 'mean', 'max'])
create_all_summary(df, ['cluster', 'Gender'], N[i], ['count', 'min', 'mean', 'max'])
KMeans(n_clusters=6, random_state=42)
Age | Annual_Income | Spending_Score | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | min | mean | max | count | min | mean | max | count | min | mean | max | |
cluster | ||||||||||||
0 | 38 | 27 | 32.763158 | 40 | 38 | 69 | 85.210526 | 126 | 38 | 63 | 82.105263 | 97 |
1 | 21 | 19 | 44.142857 | 67 | 21 | 15 | 25.142857 | 39 | 21 | 3 | 19.523810 | 40 |
2 | 34 | 19 | 41.970588 | 59 | 34 | 71 | 86.794118 | 126 | 34 | 1 | 17.264706 | 39 |
3 | 38 | 18 | 27.000000 | 40 | 38 | 39 | 56.657895 | 76 | 38 | 29 | 49.131579 | 61 |
4 | 22 | 18 | 25.272727 | 35 | 22 | 15 | 25.727273 | 39 | 22 | 61 | 79.363636 | 99 |
5 | 45 | 43 | 56.155556 | 70 | 45 | 38 | 53.377778 | 67 | 45 | 35 | 49.088889 | 60 |
Age | Annual_Income | Spending_Score | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | min | mean | max | count | min | mean | max | count | min | mean | max | ||
cluster | Gender | ||||||||||||
0 | Female | 21 | 27 | 32.190476 | 38 | 21 | 70 | 86.047619 | 120 | 21 | 69 | 81.666667 | 95 |
Male | 17 | 27 | 33.470588 | 40 | 17 | 69 | 84.176471 | 126 | 17 | 63 | 82.647059 | 97 | |
1 | Female | 13 | 20 | 41.538462 | 58 | 13 | 16 | 26.538462 | 39 | 13 | 5 | 20.692308 | 40 |
Male | 8 | 19 | 48.375000 | 67 | 8 | 15 | 22.875000 | 33 | 8 | 3 | 17.625000 | 39 | |
2 | Female | 15 | 34 | 44.600000 | 57 | 15 | 73 | 92.333333 | 126 | 15 | 5 | 21.600000 | 39 |
Male | 19 | 19 | 39.894737 | 59 | 19 | 71 | 82.421053 | 113 | 19 | 1 | 13.842105 | 36 | |
3 | Female | 25 | 18 | 27.960000 | 40 | 25 | 39 | 57.360000 | 76 | 25 | 29 | 47.120000 | 61 |
Male | 13 | 18 | 25.153846 | 40 | 13 | 42 | 55.307692 | 67 | 13 | 41 | 53.000000 | 60 | |
4 | Female | 13 | 20 | 25.461538 | 35 | 13 | 16 | 25.692308 | 39 | 13 | 65 | 80.538462 | 99 |
Male | 9 | 18 | 25.000000 | 35 | 9 | 15 | 25.777778 | 38 | 9 | 61 | 77.666667 | 92 | |
5 | Female | 25 | 43 | 54.080000 | 68 | 25 | 38 | 53.240000 | 67 | 25 | 35 | 49.520000 | 59 |
Male | 20 | 47 | 58.750000 | 70 | 20 | 39 | 53.550000 | 63 | 20 | 36 | 48.550000 | 60 |
Not Considering "Gender", as can be seen in the first table -
Considering "Gender", as can be seen in the second table -