Customer Segmentation with Machine Learning : One size doesn’t fits all
In this article , we will implement Unsupervised Machine Learning Method (K-means Clustering) to segment the customer data set into segments (which are homogenous within the cluster) and heterogeneous among the clusters.
The data set corresponds to Credit customers of a Bank with attributes as Credit Amount, Duration, Age, Purpose, Job, Housing, Gender, Savings Account, Credit Account.
Step 1: we will import the data from CSV into a data frame
Step 2: Input Features
Age — Age of the customer
Duration— duration for which Customer has taken credit
Credit Amount — value of cresit taken by the customer
Sex- Gender of the customer
Job- job profile of the customer
Purpose- purose for which credit has been taken
Housing — housing status of customer (own, rented)
Step3: Exploratory Data Analysis
Univariate Analysis: Credit Amount is right skewed
Univariate Analysis: Duration is right skewed
We will do a logarithmic trnasformation of right skewed variables to bring them closer to normal distribution
Look at a combination of Categorical and Numerical Variables
Step4: Clustering with K means Algorithm (Unsupervised Learning)
K means algorithm requires pre-specification of no of Clusters (i.e. value of K). Hence we will run the algrithm for K=1 to K=11 and measure the homogenity within generated cluster and hetrogenity across clusters for each value of K to decide on optimal value of K.
As K increases, i.e. the number of Cluserts increase and they will become more homogeneous. The optimal value of K is at point which further increase in K doesn’t leads in much decrease in WSS. This graph is also known as “Elbow Curve” where the bending point (E.g,k=3 pr 4 in our case) is known as “Elbow Point”. We will now measure the metric “Silhouette” score which will indicate the hetrogenetiy of clusters. Based on combination of WSS and Silhouette score we will finalize K.
Silhouette co-efficient ranges between -1 and 1. With 1 is theorteically maximum value indicating clusters are well apart. In this case based on results from Homogenity and Hetrogenity measures of generated clusters. We can proceed with either of K=3 or K=4 as no of clusters.
Step4: Cluster Analysis
Profile of the three clusters indicates
Cluster 0 — lower mean of credit amount, short duration, older customers
Cluster 1 — lower mean of credit amount, short duration, young customers
Cluster 2 — high mean of credit amount, long duration, middle-aged customers
Step5: Visualize Clusters
There is clear separation among clusters based on (Age and Duration) and (Age and Credit Amount). However, that is not so for (Duration and Credit Amount)