Distributed Computing | Scientific.Net

The Hadoop Ecosystem: An Open-Source Framework for Enterprise-Scale Big Data Processing and Analytics

Authors: Nagwa Elmobark, Aymen Saad, Sajjad H. Hasan, Mohamed Badouch

Abstract: The exponential growth of virtual information presents unprecedented challenges for conventional records processing systems. This research explores the Hadoop surroundings as an innovative method to Big Data control, analyzing its architecture, talents, and strategic importance in cutting-edge data analytics. The take a look at investigates Hadoop's disbursed computing framework, which permits parallel processing of huge datasets throughout commodity hardware. Key additives including the Hadoop Distributed File System (HDFS), MapReduce programming version, and YARN aid control are analyzed to illustrate the platform's specific method to handling large, complex records workloads. Comparative analysis famous Hadoop's tremendous benefits over conventional systems, consisting of cost-effectiveness, scalability, and fault tolerance. The studies highlight the environment's evolution, from its origins to contemporary cloud-based implementations, and examines integration skills with equipment like Hive, Pig, and Spark that increase its analytical potential. While identifying challenges which includes operational complexity and security concerns, the observe in the long run positions Hadoop as a vital generation for agencies in search of to leverage Big Data for strategic selection-making. The findings underscore Hadoop's ability to convert information processing tactics, offering a sturdy, flexible option to the developing needs of current statistics-pushed companies.

210

Applied-Information Technology with Distributed Text Feature Extraction Method Based on MapReduce

Authors: Lu Chen, Tao Zhang, Yuan Yuan Ma, Cheng Zhou

Abstract: With the rapid development of Internet technology and information technology, the emergence of a large number of document data, text classification techniques for handling massive amounts of data is becoming increasingly important. This paper presents a distributed text feature extraction method based on distributed computing model—MapReduce. In the process of mass text processing, solve the problem of processing text size limit and inadequate performance, provide the research of text feature extraction method a new way of thinking.

444

Design and Development of Trade Detector Based on Distributed Computing Technology

Authors: Qi Wang, Wei Lai

Abstract: J2EE is set for the distributed computing realization of API, services and protocols developed by SUN. The design objective of J2EE computer development tool is to provide a simple, easy operation development platform for developers, which can reduce the cost of development, shorten the development cycle, and improve the comprehensive performance of computer system. In this paper, we use JSP calculation vessel function module of J2EE to improve the computer economic detection system, and establish data connection between system and the HTML Webpage, which realizes the real-time and continuous detection of the economy. Finally, based on the external demand and exports of industrial products in an area, we detect trade transactions in the region, and do mathematical statistics using the J2EE system, and finally get the trade scale. It provides a new computer method for the research of export trade and economic detection.

550

Application of MATLAB Parallel Programming Technology

Authors: Song Chi

Abstract: The parallel application program development technology and process is analyzed based on the MATLAB parallel and distributed computing toolbox. Hereby, the comparison of the serial computing and the parallel computing is done by computer simulations, and the parallel computing program design and develop method is proposed. The simulations results show that, the parallel computing technology has many advantages in the calculation of high intensive and it is convenience of the parallel application development using MATLAB.

3787

Distributed Collaborative Filtering Recommendation Model Based on Expand-Vector

Authors: Ye Zhu, Hong Yi Su, Cai Qun Wang, Bo Yan, Hong Zheng

Abstract: The recommendation system based on collaborative filtering is one of the most popular recommendation mechanisms. However, with the continuous expansion of the system, several problems that traditional collaborative filtering recommendation algorithm (CF) faced such as cold startup, accuracy, and scalability are worsen. In order to address these issues, a distributed collaborative filtering recommendation model based on expand-vector (CF-EV) is proposed. Firstly, the eigenvector is expanded reasonably to get the expand-vector based on the expand-vector model, a new extension measure created in this paper. Then, the nearest neighbor user is found and a more accurate recommendation to the target user is given based on the calculation results. In addition, the further optimization makes it applied to the parallel computing framework successfully. Using the MovieLens dataset, the performance of CF-EV is compared with CF from both sides of recommendation precision and the speedup ratio. Through experimental results, CF-EV overcomes the problem of cold startup. Moreover, the accuracy and recall ratio has been doubled. With the increasing numbers of the computing nodes, the distributed implementation has linear speedup.

2188

A Parallel Implementation of the K-Means Algorithm Based on MapReduce

Authors: Wei Yan, Jing Zhou, Qi Huang, Lei Shi

Abstract: With the data explosion, data mining algorithms are required to deal with huge amounts of records. In the traditional way, the processing goes in one single control flow, the time spent in computing grows fast with the increasing of data scale. K-means is one of the widely used algorithms in cluster analysis. MapReduce is a programming model which has been widely used for processing data in a parallel environment. This paper gives an implementation of the K-means algorithm based on the MapReduce model, so that the clustering system could handle the massive data in a fast and scalable fashion. The brief structure of the algorithm and the analysis for the main improvement are also given. We demonstrated that the algorithm will be superior when the volume of data grows bigger or the number of nodes in the computer cluster grows much bigger.

1578

Snap-Stabilizing Wave Algorithm with Multiple Initiators in Arbitrary Networks

Authors: Ganesh Nandakumaran, Mehmet Hakan Karaata

Abstract: A wave is a distributed execution, often made up of a broadcast phase followed by a feedback phase, requiring the participation of all the system processes before a particular event called decision is taken. Solutions to a large number of problems such as globalsnapshots can be solved efficiently using multiple concurrent initiators. In this paper, we propose an optimal snapstabilizing algorithm, referred to as an mwave algorithm, that would be initiated by one or more initiator processes, essentially forming a collection of individual waves. Having multiple initiators enables a better reach and faster completion of broadcasted messages as a result. Our algorithm differs from existing multi-node broadcasting techniques in a few notable ways, such as working in any arbitrary network and having dynamic initiator processes that participate in an m-wave cycle depending on the presence of an external input. Being snap-stabilizing ensures the proposed algorithm always behaves according to its specification.

619

Special Database System Design Using Sector Organization and Server Clustering Techniques

Authors: Qiu Dong Sun, Jian Cun Zuo, Yu Feng Shao, Lin Gui

Abstract: In order to reform the shortcomings of common database with a slower access speed and lower security level, this paper applied sector operating directly instead of general file access, and used the distributed computing and clustering techniques to form an information server cluster as the special database system. Firstly, the layout and sector segmentation methods were provided for data access in sector based database. And then some management methods were given to control information servers in the cluster. Finally, to more efficiently schedule the tasks for storing data and querying information, a dynamic and self-adaptive scheduling algorithm was introduced into the application server of cluster. The practice shows that the system developed by this design strategy has good efficiency and security, and the access speed of the special database system is almost 25 times than that of common database.

1377

Hadoop and PaaS Collaborative Practice Research in Video Monitoring Platform

Authors: Wei Dong Zhang, Xiang Qian Ding, Rui Chun Hou

Abstract: Cloud computing is an important area of current information technology research, particularly into the era of big data, Platform as a Service (PaaS) has become the industry to explore one of the hot clouds computing. But the function of the composition and architecture of platform services not yet conclusive, the paper on the application of Video monitoring platform through collaborative practice hadoop and PaaS, and proposed a new cloud computing platform Video monitoring solutions, PaaS and various modules under this program function overview. Provides a reference model for the era of big data Hadoop and PaaS collaborative applications.

3549

Split Process Cluster: A Distributed Computing Platform for Edge Extraction of Massive Remote Sensing Images

Authors: Fu Chao Cheng, Fang Miao, Wen Hui Yang

Abstract: In existed distributed edge extraction method based on MapReduce, the inappropriate dataset split algorithms leaded to the loss problem of image features in result. We presented a distributed computing platform called Split Process Cluster (SPC) to resolve this problem. In SPC, the images are partitioned with the resilient image pyramid model (RIP), a multi-layer and redundant data structure we presented earlier, to ensure the integrity of original image features. And SPC packages the image data to the form of Key-Value pairs, which could be processed through Hadoop, and reduces the results with density-based spatial clustering of applications with noise (DBSCAN) algorithm. Compared to traditional method, the extraction rate of image feature by using SPC has been improved, which indicates that using SPC is an efficient way to improve the extraction rate of distributed edge extraction.

2268

Papers by Keyword: Distributed Computing