Research on Parallel DBSCAN Algorithm Design Based on MapReduce


Article Preview

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.



Advanced Materials Research (Volumes 301-303)

Edited by:

Riza Esa and Yanwen Wu




Y. X. Fu et al., "Research on Parallel DBSCAN Algorithm Design Based on MapReduce", Advanced Materials Research, Vols. 301-303, pp. 1133-1138, 2011

Online since:

July 2011




[1] J. MacQueen: Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. 5th Berkeley Symp. Math. Statist, Prob., 1: 281-297(1967).

[2] R. Ng, J. Han: Efficient and Effective Clustering Method for Spatial Data Mining. In: Proc. 1994 Int. Conf. Very large Data Bases, pp.144-155. Santiago, Chile (1994).

[3] S. Guha, R. Rastogi, K. Shim: CURE: An Efficient Clustering Algorithm for Large Databases. In: Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pp.73-84. Seattle, WA (1998).


[4] G. Karypis, E. H. Han, V. Kumar: CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Computer, 32: 68-75 (1999).


[5] M. Ester, H. Kriege, J. Sander, X. Xu: A Density-based Algorithm for Discovering Clusters in Large Spatial Databases. In: Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining, pp.226-231. Portland, OR (1996).

[6] W. Wang, J. Yang, R. Muntz: STING: A Statistical Information Grid Approach to Spatial Data Mining. In: Proc. 1997 Int. Conf. Very Large Data Bases, pp.186-195. Athens, Greece (1997).

[7] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proc. 1998 ACM SIGMOD Int. Conf. Management of Data, pp.94-105. Seattle, WA (1998).


[8] E. Rasmussen, P. Willett, P.: Efficiency of Hierarchical Agglomerative Clustering Using the ICL Distributed Array Processor. Journal of Documentation, 45(1): 1-24. (1989).


[9] X. Li, Z. Fang: Parallel Clustering Algorithms. Parallel Computing, 11: 275-290. (1989).


[10] C. Olson: Parallel Algorithms for Hierarchical Clustering. Parallel Computing, 21(8): 1313-1325. (1995).


[11] J. Dean, S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. In: Proc. of Operating Systems Design and Implementation, pp.137-150. San Francisco, CA (2004).

[12] Hadoop: Open source implementation of MapReduce. http: /lucene. apache. org/hadoop.

[13] S. Ghemawat, H. Gobioff, S. Leung: The Google File System. Symposium on Operating Systems Principles, pp.29-43 (2003).


[14] D. Borthakur. The Hadoop Distributed File System: Architecture and Design. (2007).

[15] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis: Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proc. of 13th Int. Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ (2007).


[16] R. Lammel.: Google's MapReduce Programming Model - Revisited. Science of Computer Programming, 70: 1-30 (2008).


[17] X. Xu, J. Jager, H. Kriegel: A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery, 3: 263-290 (1999).