A Projection-Based Locality-Sensitive Hashing Technique for Reducing False Negatives

Article Preview

Abstract:

It is challenging to efficiently find similar pairs of objects when the number of objects is huge. The locality-sensitive hashing techniques have been developed to address this issue. They employ the hash functions to map objects into buckets, where similar objects have high chances to fall into the same buckets. This paper is concerned with a locality-sensitive hashing technique, the projection-based method, which is applicable to the Euclidean distance-based similar pair identification problem. It proposes an extended method which allows an object to be hashed to more than one bucket by introducing additional hashing functions. From the experimental studies, it has been shown that the proposed method could provide better performance compared to the projection-based method.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1341-1346

Citation:

Online since:

December 2012

Authors:

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] A. Rajaraman and J. D. Ullman: Mining of Massive Datasets, Cambridge University Press (2012).

Google Scholar

[2] U. Manber: Finding similar files in a large file system, Proc. USENIX Conference (1994) 1–10.

Google Scholar

[3] A. Z. Broder: On the resemblance and containment of documents, Proc. Compression and Complexity of Sequence (1997) 21–29.

Google Scholar

[4] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher: Min-wise independent permutations, ACM Symposium on Theory of Computing (1998) 327–336.

DOI: 10.1145/276698.276781

Google Scholar

[5] A. Andoni and P. Indyk: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Comm. ACM, 51(1) (2008) 117–122.

DOI: 10.1145/1327452.1327494

Google Scholar

[6] M. S. Charika: Similarity estimation techniques from rounding algorithms, ACM Symposium on Theory of Computing (2002) 380–388.

Google Scholar

[7] K. M. Lee, K. M. Lee: Fuzzy Technique-based Identification of Close and Distant Clusters in Clustering, Int. J. of Fuzzy Logic and Intelligent Systems 11(3) (2011) 165–170.

DOI: 10.5391/ijfis.2011.11.3.165

Google Scholar

[8] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni: Locality-sensitive hashing scheme based on p-stable distribution, Symp. on Computational Geometry (2004) 253–262.

DOI: 10.1145/997817.997857

Google Scholar

[9] M. Theodbald, J. Siddhaarth, and A. Paepcke: SpotSigs - robust and efficient near duplicate detection in large web collections, 31st Annual ACM SIGIR Conference, Singapore, July (2008).

DOI: 10.1145/1390334.1390431

Google Scholar

[10] S. Chauddhuri, V. Ganti, and R. Kaushik: A primitive operattor for similarity joins in data cleaning, Proc. Int. Conf. on Data Engineering (2006).

DOI: 10.1109/icde.2006.9

Google Scholar

[11] F. Hao, J. Daugman, and P. Zielinski: A fast search algorithm for a large fuzzy database, IEEE Trans. on Information Forensics and Security 3(2) (2008).

DOI: 10.1109/tifs.2008.920726

Google Scholar

[12] M. Potthast and B. Stein: New issues in near-duplicate detection, Data Analysis, Machine Learning and Applications, Springer (2008) 601–609.

DOI: 10.1007/978-3-540-78246-9_71

Google Scholar

[13] A. Frank, A. Asuncion: UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (2010).

Google Scholar