Deep Web Data Integration with near Duplicate Free

Meng Juan Li; Lian Yin Jia; Jin Guo You; Jia Man Ding; Hai He Zhou

doi:10.4028/www.scientific.net/AMR.756-759.1855

Paper Titles

SAR Image Segmentation Based on Bayesian Network
p.1835

Application of Cloud Computing in Walling Management
p.1840

The Design of Online Examination System Based on UML & Component
p.1845

A Multicast Routing Algorithm with Cost, Delay and Bandwidth Constraints
p.1850

Deep Web Data Integration with near Duplicate Free
p.1855

The Research & Application of Network Learning Platform Based on the Component Composition
p.1860

Cyber Computing: From Plant Ecology to Cyber Ecology
p.1865

A Graphic Expression CRC Algorithm Based on Bytes Operation
p.1870

Hiding Sensitive Association Rules by Adjusting Support
p.1875

HomeAdvanced Materials ResearchAdvanced Materials Research Vols. 756-759Deep Web Data Integration with near Duplicate Free

Deep Web Data Integration with near Duplicate Free

Abstract:

Deep web data integration has become the center of many research efforts in the recent few years. Near duplicate detection is very important for deep web integration system, there are seldom researches focusing on integrating deep web Integration and near duplicate detection together. In this paper, we develop a integration system, DWI-ndfree to solve this problem. The wrapper of DWI-ndfree consists of four parts: the form filler, the navigator, the extractor and the near duplicate detector. To find near duplicate records, we propose efficient algorithm CheckNearDuplicate. DWI-ndfree can integrate deep web data with near duplicate free and has been used to execute several web extraction and integration tasks efficiently.

You might also be interested in these eBooks

View Preview

Info:

Periodical:

Advanced Materials Research (Volumes 756-759)

Pages:

1855-1859

DOI:

https://doi.org/10.4028/www.scientific.net/AMR.756-759.1855

Citation:

Cite this paper

Online since:

September 2013

Authors:

Meng Juan Li, Lian Yin Jia, Jin Guo You, Jia Man Ding, Hai He Zhou

Keywords:

CheckNearDuplicate Algorithm, Deep Web Integration, DWI-Ndfree, Near Duplicate Free, Wrapper

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] Bergman Michael K. White paper: the deep web: surfacing hidden value [J]. Journal of Electronic Publishing, 2001, 7(1).

DOI: 10.3998/3336451.0007.104

Google Scholar

[2] Garcia-Molina, Sriram Raghavan Hector, Sriram Raghavan. Crawling the Hidden Web [C]. Proceedings of the 27th International Conference on Very Large Data Bases. 2001, 129-138.

DOI: 10.1109/icde.2003.1260809

Google Scholar

[3] Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. Google's Deep Web Crawl [J]. Proceedings of the VLDB Endowment, 2008, 1(2): 1241-1252.

DOI: 10.14778/1454159.1454163

Google Scholar

[4] Ritu Khare, Yuan An, Il-Yeol Song, Understanding Deep Deb Search Interfaces: A Survey [J]. ACM SIGMOD Record, 2010, 39(1): 33-40.

DOI: 10.1145/1860702.1860708

Google Scholar

[5] Boutros R. El-Gamil, Werner Winiwarter, Bojan Božić, Harald Wahl. Deep Web Integrated Systems: Current Achievements and Open Issues [C]. Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, 2011: 447-450.

DOI: 10.1145/2095536.2095627

Google Scholar

[6] Olfa Nasraoui. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data [B]. Springer-Verlag Berlin, Heidelberg, (2009).

DOI: 10.1007/978-3-642-19460-3_12

Google Scholar

[7] Arnaud Sahuguet, Fabien Azavant. Wysiwyg Web Wrapper Factory (w4f). (1999).

Google Scholar

[8] Nilesh Dalvi, Ravi Kumar, Mohamed Soliman. Automatic Wrappers for Large Scale Web Extraction [J]. Proceedings of the VLDB Endowment, 2011, 4(4): 219-230.

DOI: 10.14778/1938545.1938547

Google Scholar

[9] Sunita Sarawagi, Alok Kirpal. Efficient Set Joins on Similarity Predicates [C]. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004: 743-754.

DOI: 10.1145/1007568.1007652

Google Scholar

[10] Chen Li, Jiaheng Lu, Yiming Lu. Efficient Merging and Filtering Algorithms for Approximate String Searches [C]. Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, 2008: 257-266.

DOI: 10.1109/icde.2008.4497434

Google Scholar

[11] Lianyin Jia, Jianqing Xi, Mengjuan Li, Yong Liu, Decheng Miao. ETI: An Efficient Index for Set Similarity Queries [J]. Frontiers of Computer Science, 2012, 6(6): 700-712.

DOI: 10.1007/s11704-012-1237-5

Google Scholar

[12] http: /en. wikipedia. org/wiki/Trie.

Google Scholar

[13] http: /www. w3. org/DOM.

Google Scholar