# Customer Segmentation with Machine Learning : One size doesn’t fits all

In this article , we will implement Unsupervised Machine Learning Method (K-means Clustering) to segment the customer data set into segments (which are homogenous within the cluster) and heterogeneous among the clusters.

The data set corresponds to Credit customers of a Bank with attributes as Credit Amount, Duration, Age, Purpose, Job, Housing, Gender, Savings Account, Credit Account.

**Step 1:** we will import the data from CSV into a data frame

**Step 2**: Input Features

*Continous Variables*

Age — Age of the customer

Duration— duration for which Customer has taken credit

Credit Amount — value of cresit taken by the customer

*Categorical Variables*

Sex- Gender of the customer

Job- job profile of the customer

Purpose- purose for which credit has been taken

Housing — housing status of customer (own, rented)

**Step3**: Exploratory Data Analysis

*Univariate Analysis: Credit Amount is right skewed*

*Univariate Analysis: Duration is right skewed*

We will do a logarithmic trnasformation of right skewed variables to bring them closer to normal distribution

*Categorical Variables:*

*Look at a combination of Categorical and Numerical Variables*

**Step4**: Clustering with K means Algorithm (Unsupervised Learning)

K means algorithm requires pre-specification of no of Clusters (i.e. value of K). Hence we will run the algrithm for K=1 to K=11 and measure the homogenity within generated cluster and hetrogenity across clusters for each value of K to decide on optimal value of K.

As K increases, i.e. the number of Cluserts increase and they will become more homogeneous. The optimal value of K is at point which further increase in K doesn’t leads in much decrease in WSS. This graph is also known as “Elbow Curve” where the bending point (E.g,k=3 pr 4 in our case) is known as “Elbow Point”. We will now measure the metric “Silhouette” score which will indicate the hetrogenetiy of clusters. Based on combination of WSS and Silhouette score we will finalize K.

Silhouette co-efficient ranges between -1 and 1. With 1 is theorteically maximum value indicating clusters are well apart. In this case based on results from Homogenity and Hetrogenity measures of generated clusters. We can proceed with either of K=3 or K=4 as no of clusters.

**Step4**: Cluster Analysis

## Profile of the three clusters indicates

Cluster 0 — lower mean of credit amount, short duration, older customers

Cluster 1 — lower mean of credit amount, short duration, young customers

Cluster 2 — high mean of credit amount, long duration, middle-aged customers

**Step5**: Visualize Clusters

There is clear separation among clusters based on (Age and Duration) and (Age and Credit Amount). However, that is not so for (Duration and Credit Amount)