Lim, Zhi Sin (2025) Comparative Assessment of Distance Metrics in K-means Clustering for Binary-Cluster Applications. Final Year Project (Bachelor), Tunku Abdul Rahman University of Management and Technology.
|
Text
Lim Zhi Sin - FULL TEXT.pdf Restricted to Registered users only Download (3MB) |
Abstract
Clustering, an unsupervised learning method, is widely used to group similar data points without the labelled outcomes and is particularly impactful in medical data analysis for identifying diagnostic classes. However, selecting the optimal distance metric is critical to improve the clustering accuracy and efficiency. This study addressed the challenge in clustering the Breast Cancer Wisconsin Diagnosis dataset, which the features has high dimensional data with complex features relationships that may overlap and cause in more of the classification errors. The purpose of the study was to explore the strength and characteristic of each distance metric using K-Mean clustering to this dataset. The method to deal the complex features relationships was the pairwise feature mapping, thresholding with the correlation coefficient to gain the potential feature instead using all of 30 features from datasets. The six potential features were selected and forming in total of 15 pairwise features to let the algorithm clustering in two dimensional spaces. A comparative analysis was conducted using Randomised and Controlled K-Mean clustering with six distance metrics: Euclidean, Manhattan, Minkowski, Chebyshev, Mahalanobis and Cosine. The performance was evaluated based on the clustering accuracy, computational efficiency, and F1-score. The result for each pairwise feature in Randomised K-Mean clustering was based on the highest accuracy and lowest iteration among the number of results generated randomly. Both clustering algorithms have the small impact on the distance metrics and demonstrated that, Controlled K-Mean having more efficient in lesser of time consuming and maintained the competitive performance in terms of F1-score and accuracy. Euclidean and Manhattan have the similar formation and quality of clusters towards datasets. Cosine, Chebyshev and Mahalanobis have dependency constraints on the patterns of data distributed with their specific distance measurements. Comparing with the metrics, Minkowski distance demonstrated the superior performance by achieving the lowest computation time while maintaining high F1-score for both Malignant and Benign classes in both clustering algorithms. Although the other distance metrics showed their strengths in specific aspects, Minkowski provided the best overall balance. Cosine distance has the lowest performance metric for this medical dataset based on the selected feature combinations. These findings would contribute to medical data analysis by offering an effective clustering approach, more reliable diagnostic processes.
| Item Type: | Final Year Project |
|---|---|
| Subjects: | Technology > Electrical engineering. Electronics engineering |
| Faculties: | Faculty of Engineering and Technology > Bachelor of Electrical and Electronics Engineering with Honours |
| Depositing User: | Library Staff |
| Date Deposited: | 14 Aug 2025 02:40 |
| Last Modified: | 14 Aug 2025 02:40 |
| URI: | https://eprints.tarc.edu.my/id/eprint/33658 |