Cluster Analysis with Tableau
Cluster Analysis is a technique used to group a set of objects based on similarities among the data points and identify patterns within the same.
In the literature, it is referred to as “pattern recognition” or “unsupervised machine learning” - “unsupervised” because we are not guided by a prior idea of which variables or samples belong in which clusters. “Learning” because the machine algorithm “learns” how to cluster.
Cluster analysis is popular in many fields, including:
• In Cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
• In Marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
• In City-planning for identifying groups of houses according to their type, value and location
​
Tableau has a build-in algorithmic feature for clustersing as per our selected data set. I will elaborate further definitions and statistical calculations about the Clustering technique later while working with RStudio.
​
Let's perform Clustering for our Spend analysis to identify purchasing patterns of Materials and cost-saving opportunities in our dataset (in continuation with Tableau Advance visuals).
To summarize our observation on the Procurement dataset so far:
There are high spending activities from Aug’18 to Apr’19 by Company Codes 7860 & 9000 on Material Type YROH & ZROH which are further sub-divided into 3 high costing Material Groups 1500, 2003 & 2211. Material code 2003 have all Materials with weightage Unit as Tons only, Material Group 1500 & 2211contains Kilograms with high cost.
​
Let's create a subset of data as per the above selections and that will be our database for Clustering in Teableau.
For the purpose of Cluster analysis, we will use fields 1.PO Quantity & 2.Par Cost (Calculated filed - [Gross value]/[Par cost]).
Click to See Procurement Tracking Dashboard
1. Cluster Analysis on Scatterplot:
Applied standard data filters :
-
Focused Period: Aug'18 to Apr'19
-
CoCd : 7860 & 9000
-
M Type: ZROH & YROH
-
Matl Group: 1500, 2003 & 2211
​
From the Analytics section on the top left side of the page in the Model section, drag and drop 'Cluster' into the worksheet.
Tableau will create an optimal number of clusters as per the trend in selected columns & rows.
(Try editing the number of clusters to observe the changes. The clustering process should stop when not much change is visible in the results. I will elaborate the statistical explanation on this while working in RStudio).
​
Scatter plots usually show relationships between 2 continuous values.
​
As per Tabluea, optimum number of clusters for our selection is 2.
The Statistical description of our Cluster is as below:
​
We can change the number of clusters as per our observation. Note: When there is not much change in the results observed, at that point we can say we have achived optimal number of clustering and no further clusters required.
​
This will be discussed in detail while working with RStudio later.
After observing the material wise Par cost trend of both clusters across the focused period, Cluster 1 have massive volatility and high cost with Material Group 1500, 2003 & 2211. Cluster 2 seem to have data points only for Jan'19 & Feb'19 hence it is comparatively less volatile.
In order to identify where is our company spending the most and if there is any cost saving opportunity, let's analyse Material Group 2003 in Cluster 1 which seem to have high cost and is highly fluctuatinng.
From our Procurement Tracking Dashboard we can see that Material Group 2003 have orders for Soya bean only and in Tons.
On the other hand, the Material cost listed under 1500 & 2211 seems to have a variety of material with low par cost.
The below Trend line plot represent each material code listed in Material group 2003. The thickness of the trend line represents high PO quantity.
Observation :
1. Material 910975 have higest cost per unit and hence was only ordered 1 time.
2. Material 910000 is more cost -effective in the long run.
3. The cost for all other materials are higher but are still prefered by organsiations.
4. Ordering Material code 910973 have a cost benefit when ordered in high quantity.
​
We can also run another cluster analysis for M.Group 1500 & 2211 to identify more such cost saving opportunities.
2. Animated Visuals with Tableau:
Another amazing feature of doing EDA with Tableau is that we can also animate our visualisation. This makes it easy for us to study the progression of cost over the period of time for reslective material.
Trend : Cluster 1 - 1500 & 2211
Trend : Cluster 1 - 2003
If you have any queries or further suggestions, mention in the comments section below or in the chatbox. - Thank you.