HOPKINS STATISTICS IN CLUSTERING

CLUSTER ASSESSING USING HOPKINS STATISTICS

Clustering in simple terms mean a collections or a group. Similarly, grouping of data can happen in data science to get effective solutions for different groups. Say for example, we take the field of marketing in a fashion store. Decisions on clothing designs need to taken for different groups like kids, teens, women, men etc., Here, decision cannot be same for all the groups. To have better marketing, the strategies used should be different for each group. In such scenarios, we take the help of clustering.

Clustering is a data analysis tool applied across data into similar group of items. There can be two aims of clustering - realistic or constructive. The aim of realistic clustering is to cluster the data to uncover the real groupings that can happen in the data whereas the aim of constructive clustering is to cluster the data no matter if the real grouping is inherent in the data or no. Example - The above marketing example is a realistic clustering and a scenario of clustering students based on their weights is constructive clustering.

The steps that are involved in clustering :

Data Preprocessing - Though clustering is also a part of Exploratory Data Analysis (EDA), the initial data preprocessing steps should be performed like removing null values, outliers, principal component analysis etc.,
Evaluate Clustering tendency - Check if the data will be able to perform clustering. If yes, we can proceed with clustering. If no, more data will be required or data should be reprocessed or the solution will not be qualitative.
Clustering Algorithm - Apply clustering algorithms to create optimum number of clusters.

One should be very careful while applying clustering because the algorithm will cluster the data no matter if clustering is possible or no. So the most important step in clustering is to evaluate the clustering tendency.

Hopkin's Statistics

Hopkins Statistics is a method to evaluate whether a data has inherent groups in it or no.

According to Lawson and Jurs (1990), Hopkins Statistics is used to evaluate the tendency of clustering in a data by finding the probability that the data has uniform distribution. In simple terms, Hopkins Statistics checks the randomness of the data.

Formula of Hopkins Statistics :

H = ∑u_{i ∕ (}∑u_i+ ∑w_{i )}

_{Hypothesis of Hopkins Statistics:}

_{Null Hypothesis : The data set is uniformly distributed (meaning no meaningful cluster is possible)}

_{Alternative Hypothesis : The data set is not uniformly distributed (meaning there can be a meaningful clusters)}

How Hopkin Statistics Works?

From the dataset, software (like python and R) will randomly sample data (ie., randomly select data)
Let us say 'n' number of sample data is collected from the original data
Then 'n' number of artificial data will be generated by the software (ie., randomly generated data)
The distance between the sample data to its nearest data from original dataset will be calculated (d = u_{i )}
_{Then the distance between the artificial data to its nearest data from original dataset will be calculated (}d = w_i )
These values will be put in Hopkins Statistics formula and find the value of Hopkins Index.
If H > 0.5, means that data has no inherent groups (ie., a meaningful cluster is not possible)
If H < 0.5, a meaningful cluster is possible.

Demo of Hopkins Statistics Working:

1. Say we have a dataset (D)- [(2,2), (3,3), (4,3), (7,5), (7,4), (8,4)]

Dataset D

2. A sample is taken from dataset D - [(3,3), (7,5), (8,4)]

3. The distance between the sample data to its nearest neighbor data from original dataset D is calculated.

The distance from (3,3) to (2,2) and (4,3) is calculated, distance from (7,5) to (7,4) and from (8,4) to (7,4) is calculated and summed up (∑u_i.)

4. It was said that software generates an artificial data randomly. Since, we have taken three sample data, there should be three artificial data points.

Here, we will take [(3,2), (3,5), (7,2)]

Here, the points in triangle are the artificially generated data points.

5. The distance between the artificial data to its nearest neighbor data from original dataset D is calculated.

The green colour represents the distance from artificial data to original data and then these green coloured distances are summed up (∑w_i.)

6. Now Hopkins Statistics formula is used.

Summary

In this blog, the working of Hopkins Statistics was described. If the H index is above 0.5 showing that the data is clusterable, the next step is to find the optimum number of clusters and cluster them

References

1. Adolfsson, A., Ackerman, M., & Brownstein, N. C. (2019). To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognition, 88, 13-26.

2. https://sushildeore99.medium.com/really-what-is-hopkins-statistic-bad1265df4b

Hope this blog was helpful !
Remember - "Every learner was once a beginner''

- Tejaswini

Search This Blog

DataScienceLearner

HOPKINS STATISTICS IN CLUSTERING

CLUSTER ASSESSING USING HOPKINS STATISTICS

Hopkin's Statistics

Summary

References

Popular posts from this blog

Enemies of Data Scientists ! Part - 1