Penerapan Metode DBSCAN dalam Memperbaiki Kinerja K-Means untuk Penggerombolan Data Tweet

  • Astri Fatimah Department of Statistics, IPB
  • Anang Kurnia Department of Statistics, IPB
  • Septian Rahardiantoro Department of Statistics, IPB
  • Yani Nurhadryani Department of Computer Science, IPB
Keywords: cluster analysis; DBSCAN; kmeans; silhouette coefficient; text data


Text Mining is collecting text data mining results from a computer to get information contained therein. Text data has a form of data that is not structured and difficult to analyze. The unstructured data can be used as structured data through pre-processing stages. Text data is represented as numerical data after going through the pre-processing stages using vector space model method and weighting method of inverse frequency document frequency so that it can be used for analysis. The K-Means cluster analysis is one method that can be used for unstructured data, but the K-Means method is not robust to noise. Outliers can be detected using Density Based Spatial Clustering of Application with Noise (DBSCAN) cluster analysis. Outliers obtained from DBSCAN results can be omitted in the data. Cluster analysis was carried out again after removal of outliers using the K-Means method with the same number of k clusters. Evaluation of the cluster that is used to see the goodness of the cluster results is Silhouette Coefficient (SC). The SC value of the K-Means method after removal of outliers has a significant increase of 0.21 for a small amount of data. Adding the amount of text data to cluster analysis also affects the number of clusters. This is influenced by the number of katas in a document that is given weight. The fewer katas that are given weight, the more number of clusters will be generated