Deep Web Data Integration with near Duplicate Free

Article Preview

Abstract:

Deep web data integration has become the center of many research efforts in the recent few years. Near duplicate detection is very important for deep web integration system, there are seldom researches focusing on integrating deep web Integration and near duplicate detection together. In this paper, we develop a integration system, DWI-ndfree to solve this problem. The wrapper of DWI-ndfree consists of four parts: the form filler, the navigator, the extractor and the near duplicate detector. To find near duplicate records, we propose efficient algorithm CheckNearDuplicate. DWI-ndfree can integrate deep web data with near duplicate free and has been used to execute several web extraction and integration tasks efficiently.

You might also be interested in these eBooks

Info:

Periodical:

Advanced Materials Research (Volumes 756-759)

Pages:

1855-1859

Citation:

Online since:

September 2013

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] Bergman Michael K. White paper: the deep web: surfacing hidden value [J]. Journal of Electronic Publishing, 2001, 7(1).

DOI: 10.3998/3336451.0007.104

Google Scholar

[2] Garcia-Molina, Sriram Raghavan Hector, Sriram Raghavan. Crawling the Hidden Web [C]. Proceedings of the 27th International Conference on Very Large Data Bases. 2001, 129-138.

DOI: 10.1109/icde.2003.1260809

Google Scholar

[3] Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. Google's Deep Web Crawl [J]. Proceedings of the VLDB Endowment, 2008, 1(2): 1241-1252.

DOI: 10.14778/1454159.1454163

Google Scholar

[4] Ritu Khare, Yuan An, Il-Yeol Song, Understanding Deep Deb Search Interfaces: A Survey [J]. ACM SIGMOD Record, 2010, 39(1): 33-40.

DOI: 10.1145/1860702.1860708

Google Scholar

[5] Boutros R. El-Gamil, Werner Winiwarter, Bojan Božić, Harald Wahl. Deep Web Integrated Systems: Current Achievements and Open Issues [C]. Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, 2011: 447-450.

DOI: 10.1145/2095536.2095627

Google Scholar

[6] Olfa Nasraoui. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data [B]. Springer-Verlag Berlin, Heidelberg, (2009).

DOI: 10.1007/978-3-642-19460-3_12

Google Scholar

[7] Arnaud Sahuguet, Fabien Azavant. Wysiwyg Web Wrapper Factory (w4f). (1999).

Google Scholar

[8] Nilesh Dalvi, Ravi Kumar, Mohamed Soliman. Automatic Wrappers for Large Scale Web Extraction [J]. Proceedings of the VLDB Endowment, 2011, 4(4): 219-230.

DOI: 10.14778/1938545.1938547

Google Scholar

[9] Sunita Sarawagi, Alok Kirpal. Efficient Set Joins on Similarity Predicates [C]. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004: 743-754.

DOI: 10.1145/1007568.1007652

Google Scholar

[10] Chen Li, Jiaheng Lu, Yiming Lu. Efficient Merging and Filtering Algorithms for Approximate String Searches [C]. Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, 2008: 257-266.

DOI: 10.1109/icde.2008.4497434

Google Scholar

[11] Lianyin Jia, Jianqing Xi, Mengjuan Li, Yong Liu, Decheng Miao. ETI: An Efficient Index for Set Similarity Queries [J]. Frontiers of Computer Science, 2012, 6(6): 700-712.

DOI: 10.1007/s11704-012-1237-5

Google Scholar

[12] http: /en. wikipedia. org/wiki/Trie.

Google Scholar

[13] http: /www. w3. org/DOM.

Google Scholar