Customer Segmentation with Machine Learning : One size doesn’t fits all

Abhishek Sharma
4 min readJan 30, 2021

In this article , we will implement Unsupervised Machine Learning Method (K-means Clustering) to segment the customer data set into segments (which are homogenous within the cluster) and heterogeneous among the clusters.

The data set corresponds to Credit customers of a Bank with attributes as Credit Amount, Duration, Age, Purpose, Job, Housing, Gender, Savings Account, Credit Account.

Step 1: we will import the data from CSV into a data frame

Step 2: Input Features

Continous Variables
Age — Age of the customer
Duration— duration for which Customer has taken credit
Credit Amount — value of cresit taken by the customer

Categorical Variables
Sex- Gender of the customer
Job- job profile of the customer
Purpose- purose for which credit has been taken
Housing — housing status of customer (own, rented)

Step3: Exploratory Data Analysis

Univariate Analysis: Credit Amount is right skewed

Univariate Analysis: Duration is right skewed

We will do a logarithmic trnasformation of right skewed variables to bring them closer to normal distribution

Categorical Variables:

Look at a combination of Categorical and Numerical Variables

there is some positive correlation between Credit Amount and Duration, which is not varying with Gender
It is observed that Female are younger than males
there is no observable difference between Housing Categories
The plot above shows that the biggest amounts are taken for vacations/others, the smallest for domestic appliances.
Shortest Duration of credit is for domestic appliances
Customers with job type 3 has taken maximum credit

Step4: Clustering with K means Algorithm (Unsupervised Learning)

K means algorithm requires pre-specification of no of Clusters (i.e. value of K). Hence we will run the algrithm for K=1 to K=11 and measure the homogenity within generated cluster and hetrogenity across clusters for each value of K to decide on optimal value of K.

As K increases, i.e. the number of Cluserts increase and they will become more homogeneous. The optimal value of K is at point which further increase in K doesn’t leads in much decrease in WSS. This graph is also known as “Elbow Curve” where the bending point (E.g,k=3 pr 4 in our case) is known as “Elbow Point”. We will now measure the metric “Silhouette” score which will indicate the hetrogenetiy of clusters. Based on combination of WSS and Silhouette score we will finalize K.

Silhouette co-efficient ranges between -1 and 1. With 1 is theorteically maximum value indicating clusters are well apart. In this case based on results from Homogenity and Hetrogenity measures of generated clusters. We can proceed with either of K=3 or K=4 as no of clusters.

Add the Cluster Label to the dataset to analyze the generated clusters

Step4: Cluster Analysis

Profile of the three clusters indicates

Cluster 0 — lower mean of credit amount, short duration, older customers

Cluster 1 — lower mean of credit amount, short duration, young customers

Cluster 2 — high mean of credit amount, long duration, middle-aged customers

Step5: Visualize Clusters

There is clear separation among clusters based on (Age and Duration) and (Age and Credit Amount). However, that is not so for (Duration and Credit Amount)