A Parallel Implementation of the K-Means Algorithm Based on MapReduce

Article Preview

Abstract:

With the data explosion, data mining algorithms are required to deal with huge amounts of records. In the traditional way, the processing goes in one single control flow, the time spent in computing grows fast with the increasing of data scale. K-means is one of the widely used algorithms in cluster analysis. MapReduce is a programming model which has been widely used for processing data in a parallel environment. This paper gives an implementation of the K-means algorithm based on the MapReduce model, so that the clustering system could handle the massive data in a fast and scalable fashion. The brief structure of the algorithm and the analysis for the main improvement are also given. We demonstrated that the algorithm will be superior when the volume of data grows bigger or the number of nodes in the computer cluster grows much bigger.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 989-994)

Pages:

1578-1581

Citation:

Online since:

July 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] Jeffery Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation, San Francisco, CA, (2004).

Google Scholar

[2] Shuping liu and Yanliu Cheng. Research on K-means Algorithm Based on Cloud Computing. In Proceedings of Computer Science and Service System (CSSS), 2012 International Conference, Nanjing, China, (2012).

DOI: 10.1109/csss.2012.440

Google Scholar

[3] Yunfeng Xu, Yan Zhang, and Rui Ma. K-means algorithm based on Cloud Computing. In Computational Intelligence and Design (ISCID), 2012 Fifth International Symposium. Hangzhou, China, (2012).

DOI: 10.1109/iscid.2012.242

Google Scholar

[4] Jing Zhang, Gongqing Wu, and Haiguang Li. A 2-Tier Clustering Algorithm with Map-Reduce. In ChinaGrid Conference (China-Grid), 2010 Fifth Annual, Guangzhou, China, (2010).

DOI: 10.1109/chinagrid.2010.14

Google Scholar

[5] Yingan Li. Research on Parallelization of Clustering Algorithm Based on MapReduce. In Master's thesis, (2010).

Google Scholar