Comparison of Distributed K-Means and Distributed Fuzzy C-Means Algorithms for Text Clustering

I Made Artha Agastya; Teguh Bharata Adji; Noor Akhmad Setiawan

doi:10.21924/cst.2.1.2017.46

PDF

DOI: https://doi.org/10.21924/cst.2.1.2017.46

Keywords:

K-Means, FCM, Tanimoto Distance, MapReduce, Hadoop

I Made Artha Agastya

Universitas Gadjah Mada

Teguh Bharata Adji

Universitas Gadjah Mada

Noor Akhmad Setiawan

Universitas Gadjah Mada

Abstract

Text clustering has been developed in distributed system due to increasing data. The popular algorithms like K-Means (KM) and Fuzzy C-Means (FCM) are combined with MapReduce algorithm in Hadoop Environment to be distributable and parallelizable. The problem is performance comparison between Distributed KM (DKM) and Distributed FCM (DFCM) that use Tanimoto Distance Measure (TDM) has not been studied yet. It is important because TDMâ€™s characteristics are scale invariant while allowing discrimination collinear vectors. This work compared the combination of TDM with DKM (DKM-T) and TDM with DFCM (DFCM-T) to acquire performance of both algorithms. The result shows that DFCM-T has better intra-cluster and inter-cluster densities than those of DKM-T. Moreover, DFCM-T has lower processing time than that of DKM-T when total nodes used are 4 and 8. DFCM-T and DKM-T could perform clustering of 1,400,000 text files in 16.18 and 9.74 minutes but the preprocessing times take hours.

Downloads

Download data is not yet available.

How to Cite

Agastya, I. M. A., Adji, T. B., & Setiawan, N. A. (2017). Comparison of Distributed K-Means and Distributed Fuzzy C-Means Algorithms for Text Clustering. Communications in Science and Technology, 2(1). https://doi.org/10.21924/cst.2.1.2017.46

Issue

Vol. 2 No. 1 (2017)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright

Open Access authors retain the copyrights of their papers, and all open access articles are distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution and reproduction in any medium, provided that the original work is properly cited.

The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

While the advice and information in this journal are believed to be true and accurate on the date of its going to press, neither the authors, the editors, nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

1. X. Wu, X. Zhu, G. Wu, and W. Ding, â€œData mining with big data,â€ Knowl. Data â€¦, vol. 26, no. 1, pp. 97â€“107, 2014.
2. A. Gandomi and M. Haider, â€œBeyond the hype: Big data concepts, methods, and analytics,â€ Int. J. Inf. Manage., vol. 35, no. 2, pp. 137â€“144, 2015.
3. T. White, Hadoopâ€¯: The Definitive Guide. 2015.
4. S. Sathya and N. Rajendran, â€œA Review on Text Mining Techniques,â€ Int. J. Comput. Sci. Trends Technol., vol. 3, no. 5, pp. 274â€“284, 2013.
5. R. C. Esteves Rui, â€œUsing Mahout for clustering Wikipediaâ€™s latest articles: A comparison between k-means and fuzzy c-means in the cloud,â€ Proc. - 2011 3rd IEEE Int. Conf. Cloud Comput. Technol. Sci. CloudCom 2011, pp. 565â€“569, 2011.
6. S. Madhukumar and N. Santhiyakumari, â€œEvaluation of k-Means and fuzzy C-means segmentation on MR images of brain,â€ Egypt. J. Radiol. Nucl. Med., vol. 46, no. 2, pp. 475â€“479, 2015.
7. S. K. Sahu and S. K. Jena, â€œA Study of K-Means and C-Means Clustering Algorithms for Intrusion Detection Product Development,â€ Int. J. Innov. Manag. Technol., vol. 5, no. 3, pp. 207â€“213, 2014.
8. S. Panda, S. Sahu, P. Jena, and S. Chattopadhyay, â€œComparing fuzzy-C means and K-means clustering techniques: A comprehensive study,â€ Adv. Intell. Soft Comput., vol. 166 AISC, no. VOL. 1, pp. 451â€“460, 2012.
9. L. Sahu and B. R. Mohan, â€œAn improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop,â€ 9th Int. Conf. Ind. Inf. Syst. ICIIS 2014, 2015.
10. E. Jain and S. K. Jain, â€œUsing Mahout for clustering similar Twitter users: Performance evaluation of k-means and its comparison with fuzzy k-means,â€ Proc. - 5th IEEE Int. Conf. Comput. Commun. Technol. ICCCT 2014, pp. 29â€“33, 2015.
11. E. Jain and S. K. Jain, â€œCategorizing twitter users on the basis of their interests using hadoop/mahout platform,â€ 9th Int. Conf. Ind. Inf. Syst. ICIIS 2014, 2015.
12. P. Muniz De Avila et al., â€œComparing K-Means and Mean Shift Algorithms Performance Using Mahout in a Private Cloud Environment,â€ J. Commun. Comput., vol. 11, pp. 45â€“51, 2014.
13. S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action. 2011.
14. J. Ghosh and A. Strehl, â€œSimilarity-based text clustering: A comparative study,â€ Group. Multidimens. Data Recent Adv. Clust., no. ii, pp. 73â€“97, 2006.
15. A. Huang, â€œSimilarity measures for text document clustering,â€ Proc. Sixth New Zeal., no. April, pp. 49â€“56, 2008.
16. A. S. Joydeep, E. Strehl, J. Ghosh, R. Mooney, and A. Strehl, â€œImpact of Similarity Measures on Web-page Clustering,â€ Work. Artif. Intell. Web Search (AAAI 2000), 2000.
17. A. Rangrej, â€œComparative Study of Clustering Techniques for Short Text Documents,â€ Media, pp. 111â€“112, 2011.
18. X. Zhang, J. Zhao, and Y. LeCun, â€œCharacter-level Convolutional Networks for Text Classification,â€ Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, pp. 3057â€“3061, 2015.
19. M. Eroglu S., Toprak S., Urgan O, MD, Ozge E. Onur, MD, Arzu Denizbasi, MD, Haldun Akoglu, MD, Cigdem Ozpolat, MD, Ebru Akoglu, Hadoop Solutions, vol. 33. 2012.
20. ASF, â€œApache Mahout: Scalable machine learning and data mining.â€ 2016.

	All	Since 2018
Citations	659	640
h-index	12	12
i10-index	18	16

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References