k-means Based Document Clustering with Automatic ?k? Selection and Cluster Refinement


Author(s): Himanshu Gupta, Dr. Rajeev Srivastava

In recent years use of web has been increased manifold. Efficiency is as important as accuracy. Automatic document clustering is an important part of many important fields such as data mining, information retrieval etc. Most of the document clustering techniques are based on k-means and it’s variants. K-means is a fast algorithm but there are some shortcomings with this technique. K in k-means stands for no of clusters which a user has to provide but most of the times users don’t have any clue about k. In our implementation of document clustering technique we used SVD (Singular Vector Decomposition) to find out no of clusters (value of k) required. Then k-means algorithm is used to create clusters and in last phase of algorithm the clusters are refined by feature voting. Refinement phase enable us to make our algorithm much faster than k-means algorithm.