The comparison of k-means and k-medoids algorithms for clustering the spread of the covid-19 outbreak in Indonesia

The coronavirus spreads quickly through human-to-human transmission via close contact and respiratory droplets such as coughing or sneezing. Various studies have been carried out to deal with Covid-19. However, the cure for this virus has not been found until now. Based on data from the covid19.go.id page retrieved on January 1st, 2021, which was updated by the Ministry of Health, the overall number of confirmed cases was 1,078,314 active cases reaching 175,095 or 16.2% of confirmed cases, recovered 873,221 or 81.0% of confirmed cases, and death 29,998 or 2.8% of the confirmed cases. This study compares the two algorithms of data groups to analyze clustering patterns to determine the best data processing method. The data in this study sourced from the Ministry of Health, contained 4 attributes, including confirmed cases, treatment, recovery, and death cases. In this study, only 2 attributes were used: the confirmed and death cases. From the data analysis and processing results through a comparison between the K-Means method and the K-Medoids for clustering the spread of the coronavirus in Indonesia, a conclusion is derived. With the Davies Boulden index value from K2 to K9 values, it turns out that the K-Means method gets the smallest value at the K-5 of 0.064, while K-Medoids at the k-2 value of 0.411. Thus, from the two methods used, it can be concluded that the best method for clustering the spread of the coronavirus outbreaks in Indonesia is the K-Means method. 32 ILKOM Jurnal Ilmiah Vol. 13 No. 1, April 2021, pp.31-35 E-ISSN 2548-7779 Utomo (The comparison of k-means and k-medoids algorithms for clustering the spread of the covid-19 outbreak in Indonesia) This study directly compares the two algorithms from the dataset to analyze clustering patterns and determine the best algorithms in data processing. While the data used sourced from the Ministry of Health, there are 4 attributes, including confirmed, treatment, recovery, and death cases. However, only 2 attributes are used in this study namely the confirmed and death cases. In this test, the RapidMiner application version 9.8 was used. Then, the Davies-Bouldin index value was also used as a reference in clustering. From the generated clustering pattern, then the data can be analyzed. So that it becomes new information and can help decision-makers (stakeholders) reduce the spread of the coronavirus and minimize the number of patients who have been confirmed positive for Covid-19. Other than that, it can produce the best algorithm from validity testing applied to K-Means and K-Medoids. Method The implementation of the method used in the study of clustering the spread of the coronavirus in Indonesia with the K-Means and K-Medoids algorithms is through several stages as can be seen in Figure 1. Figure 1. Research Methodology A. Data Mining Data mining is a series of finding the pattern of a relationship, extracting added value, both data and essential information such as knowledge, to find a relationship and simplify data. So that information can be understood and useful, and assisted through statistics, mathematics, artificial intelligence, and machine learning [7]. Clustering is the task of dividing data and vectors into several clusters according to their respective characteristics. The data with similar characteristics will be sorted into the same group. Then, the data with different characteristics will be segregated into different groups. No label is required on each processed data because it can be given after the cluster or group is formed. As it does not use class labels on each data, it is often called unsupervised learning [8] B. K-Means Algorithm The K-Means algorithm is an algorithm that clusters data by trying to separate data into groups so that the data that have similarities are in the same group. However, the different data are classified in other groups [9]. The following is a stage in the K-Means algorithm: First, determine the number of groups (k), then choose the group center randomly. Second, compute the distance from each data to the cluster center. Third, cluster the data into groups with the nearest distance. Fourth, compute the new group center and then repeat the second to the fourth steps until no more data is moved to other groups [10]. In the process of clustering, it can be started by identifying grouped data, using the Euclidean Distance formula (1) (2) [11]. d(i,j) = √(X1i− X1j) + (X2i − X2j) + ⋯ + (Xki − Xkj) (1)

This study directly compares the two algorithms from the dataset to analyze clustering patterns and determine the best algorithms in data processing. While the data used sourced from the Ministry of Health, there are 4 attributes, including confirmed, treatment, recovery, and death cases. However, only 2 attributes are used in this study namely the confirmed and death cases.
In this test, the RapidMiner application version 9.8 was used. Then, the Davies-Bouldin index value was also used as a reference in clustering. From the generated clustering pattern, then the data can be analyzed. So that it becomes new information and can help decision-makers (stakeholders) reduce the spread of the coronavirus and minimize the number of patients who have been confirmed positive for Covid-19. Other than that, it can produce the best algorithm from validity testing applied to K-Means and K-Medoids.

Method
The implementation of the method used in the study of clustering the spread of the coronavirus in Indonesia with the K-Means and K-Medoids algorithms is through several stages as can be seen in Figure 1.

A. Data Mining
Data mining is a series of finding the pattern of a relationship, extracting added value, both data and essential information such as knowledge, to find a relationship and simplify data. So that information can be understood and useful, and assisted through statistics, mathematics, artificial intelligence, and machine learning [7]. Clustering is the task of dividing data and vectors into several clusters according to their respective characteristics. The data with similar characteristics will be sorted into the same group. Then, the data with different characteristics will be segregated into different groups. No label is required on each processed data because it can be given after the cluster or group is formed. As it does not use class labels on each data, it is often called unsupervised learning [8]

B. K-Means Algorithm
The K-Means algorithm is an algorithm that clusters data by trying to separate data into groups so that the data that have similarities are in the same group. However, the different data are classified in other groups [9].
The following is a stage in the K-Means algorithm: First, determine the number of groups (k), then choose the group center randomly. Second, compute the distance from each data to the cluster center. Third, cluster the data into groups with the nearest distance. Fourth, compute the new group center and then repeat the second to the fourth steps until no more data is moved to other groups [10]. In the process of clustering, it can be started by identifying grouped data, using the Euclidean Distance formula (1) (2) [11]. Notes: D (i, j) = The distance between the i to cluster data center j X ki = The data to the i on the data attribute to k X kj = The center point to j on the attribute to k Where C is the data center, m is a data member that enters a particular center or certain centroid, and n is the amount of data that belongs to a certain centroid.

C. K-Medoids Algorithm
The K-Medoids algorithm is part of the group clustering. The k-medoids method is quite efficient in collecting small data [12]. The first step is to determine the most representative points (medoids) in the data group by computing the distance in a cluster from all combinations of medoids so that the distance between points in a group is small, while the distance of the point between groups is large [13]. K-Medoids are objects that represent their point of reference, not taking values as the mean of an object in each group. The algorithm will take the parameter from the input of k, with the number of groups that will be segregated between one part of n objects [14] - [15].
The steps of the K-Medoids algorithm are as follows [16] - [17]: 1. Initiate a cluster center (number of clusters) 2. Allocate each data (object) to the closest cluster using the Euclidian Distance equation as in the equation formula 1: 3. Select objects randomly in each cluster as candidates for the new medoid. 4. Then compute the distance to each object in each cluster with a new medoid candidate. 5. Next, compute the total deviation (S) by computing the new total distance value with the old total distance value. When S <0, then swap objects with cluster data to form a new set of k objects as medoids. 6. Then repeat the steps 3 to 5 until there is no change in medoid so that the cluster and members of each cluster are obtained.

D. Davies-Bouldin Index (DBI)
The Davies-Bouldin Index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. It is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset [18]. Meanwhile, the separation is based on the distance between the cluster center point to its cluster. As a measure, DBI can maximize the distance between clusters Ci and Cj while trying to minimize the distance between points in the cluster. When the distance between the clusters is maximum, that is what distinguishes each cluster significantly. Thus, the slight differences between the clusters are clearer. When the distance between the clusters is minimal, it means that each object in the cluster has a high level of characteristics. Formula (3) is the formula used in computing the Davies-Bouldin Index [19]: Where the formula; k is the number of clusters used. When the Davies-Bouldin Index value obtained is getting smaller (non-negative> = 0), then the resulting cluster will be better [10].

A. Steps in Data Pre-processing
The data in this study was sourced from the covid19.co.id website page. It was retrieved on January 1 st , 2021, and was updated by the Ministry of Health on January 31 st , 2021. Then the data pre-processing stage is carried out before running the clustering process of some of the attributes used. There are 34 provinces and 4 attributes from the data obtained, namely death, recovery, in care or independent isolation and confirmed cases. Furthermore, there are 2 attributes chosen as a reference in the clustering process, namely confirmed and death cases, as shown in Table 1.

B. Steps of Data Processing
In the previous stage there is a pre-processing data, which is then processed with the Rapid Miner version 9.8 application. The purpose of this research is to compare which clusters are best between the K-Means and K-Medoids algorithms in clustering data on the spread of the coronavirus-19 disease in Indonesia in cluster 1 to cluster 10. Then there is the clustering process using these two algorithms, which in the next stage of this research is to determine the best number of clusters with Rapidminer as presented on the DBI index. This processing can be done to find the DBI index value of each algorithm in each group. Then the test process is carried out from clusters k = 2 to k = 10. Furthermore, the results of the DBI index comparison can be shown in Figure 2.

Figure 2. Comparison of the K-Means and the K-Medoids algorithm
Furthermore, it can be seen that the test results in each cluster using the K-Means and K-Medoids algorithm methods can be seen in Table 2. After previously making comparisons through the Davies Boulden index using the K-Means and K-Medoids algorithm methods, started from K2 to K10 as presented in table 2, then, find the smallest Davies Boulden index value from K-Means, that is located in the 5th K-Means cluster with a Davies Boulden index value of 0.064. After that, find the smallest Davies Boulden index value on K-Medoids, located in the 2nd K-Medoids cluster with a Davies Boulden index value of 0.411. The analysis using the Davies Boulden index value on K-Means (k = 5) shows that the value is smaller than the K-Medoids. Therefore, it can be concluded that clustering using the K-Means algorithm is better than the K-Medoids algorithm in clustering the spread of coronavirus. Based on the analysis results, the best cluster then interprets the data clustering of the spread of coronavirus, and the results can be obtained in Table 3.