A New Data Classification Algorithm for Data-Intensive Computing Environments

Article Preview

Abstract:

In order to solve the problem of how to improve the scalability of data processing capabilities and the data availability which encountered by data mining techniques for Data-intensive computing, a new method of tree learning is presented in this paper. By introducing the MapReduce, the tree learning method based on SPRINT can obtain a well scalability when address large datasets. Moreover, we define the process of split point as a series of distributed computations, which is implemented with the MapReduce model respectively. And a new data structure called class distribution table is introduced to assist the calculation of histogram. Experiments and results analysis shows that the algorithm has strong processing capabilities of data mining for data-intensive computing environments.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 756-759)

Pages:

3318-3323

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] W. Peng, M. Dan, Review of Programming Models for Data-Intensive Computing, 11th ed., vol. 47. Journal of Computer Research and Development, 2010, p.1993-(2002).

Google Scholar

[2] T. Richard, Kouzes, et al, The Changing Paradigm of Data-Intensive Computing, 1th ed., vol. 42, Computer, 2009, pp.26-34.

Google Scholar

[3] J. Dean, S. Ghemawat, Mapreduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation(OSDI), (2004).

Google Scholar

[4] T. Ashish, S. Joydeep, et al, Hive-A Warehousing Solution Over a Map-Reduce Framework, PVLDB, Vol. 2, no. 2, 2009, pp.1626-1629.

Google Scholar

[5] M. Mehta, R. Agrawal and J. Rissanen, SLIQ: A fast scalable classifier for data mining, Lecture Notes in Computer Science, Vol. 1057, Advances in Database Technology , 1996, pp.18-32.

DOI: 10.1007/bfb0014141

Google Scholar

[6] J. Shafer, R. Agrawal, M. Mehta, SPRINT: a Scalable Parallel Classifier for Data Mining, /Proceedings of the 22nd VLDB Conference Mumbai( Bombay). Mumbai M organ Kaufmann, 1996, pp.544-555.

Google Scholar

[7] D. Caragea, A. Silvescu, Decision tree induction from distributed heterogeneous autonomous data sources, In Proc of the Conference on intelligent Systems Design and Applications. USA, (2003).

DOI: 10.1007/978-3-540-44999-7_33

Google Scholar

[8] D. Nan, J. Genlin, Research and Implementation of ID3 Based on Distributed Database System, Journal of Nanjing Normal University (Engineering and Technology), Vol. 5, no. 4, 2005, pp.46-48.

Google Scholar

[9] P. Biswanath, S. Joshua, et al, PLANT: Massively Parallel Learning of Tree Ensembles with MapReduce, VLDB Endowment, 2009, pp.24-28.

Google Scholar