A Locality Sensitive Hashing Technique for Categorical Data

Article Preview

Abstract:

The measured data may contain various types of attributes such as continuous, categorical, and set-valued attributes. Several locality-sensitive hashing techniques, which enable to find similar pairs of data in a fast and approximate way, have been developed for data with either numeric or set-valued attributes. This paper introduces a new locality sensitive-hashing technique applicable to data with categorical attributes.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

3159-3164

Citation:

Online since:

December 2012

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2013 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] A. Rajaraman and J. D. Ullman: Mining of Massive Datasets, Cambridge University Press (2012).

Google Scholar

[2] S. Boriah, V. Chandola, V. Kumar: Similarity Measures for Categorical Data: A Comparative Evaluation, Proc. of the 8th SIAM Int. Conf. on Data Mining (2008) 243-254.

DOI: 10.1137/1.9781611972788.22

Google Scholar

[3] U. Manber: Finding similar files in a large file system, Proc. USENIX Conference (1994) 1–10.

Google Scholar

[4] A. Z. Broder: On the resemblance and containment of documents, Proc. Compression and Complexity of Sequence (1997) 21–29.

Google Scholar

[5] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher: Min-wise independent permutations, ACM Symposium on Theory of Computing (1998) 327–336.

DOI: 10.1145/276698.276781

Google Scholar

[6] A. Andoni and P. Indyk: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Comm. ACM, 51(1) (2008) 117–122.

DOI: 10.1145/1327452.1327494

Google Scholar

[7] M. S. Charika: Similarity estimation techniques from rounding algorithms, ACM Symposium on Theory of Computing (2002) 380–388.

Google Scholar

[8] K. M. Lee, K. M. Lee: Fuzzy Technique-based Identification of Close and Distant Clusters in Clustering, Int. J. of Fuzzy Logic and Intelligent Systems 11(3) (2011) 165–170.

DOI: 10.5391/ijfis.2011.11.3.165

Google Scholar

[9] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni: Locality-sensitive hashing scheme based on p-stable distribution, Symp. on Computational Geometry (2004) 253–262.

DOI: 10.1145/997817.997857

Google Scholar

[10] M. Theodbald, J. Siddhaarth, and A. Paepcke: SpotSigs - robust and efficient near duplicate detection in large web collections, 31st Annual ACM SIGIR Conference, Singapore, July (2008).

DOI: 10.1145/1390334.1390431

Google Scholar

[11] S. Chauddhuri, V. Ganti, and R. Kaushik: A primitive operattor for similarity joins in data cleaning, Proc. Int. Conf. on Data Engineering (2006).

DOI: 10.1109/icde.2006.9

Google Scholar

[12] F. Hao, J. Daugman, and P. Zielinski: A fast search algorithm for a large fuzzy database, IEEE Trans. on Information Forensics and Security 3(2) (2008).

DOI: 10.1109/tifs.2008.920726

Google Scholar

[13] M. Potthast and B. Stein: New issues in near-duplicate detection, Data Analysis, Machine Learning and Applications, Springer (2008) 601–609.

DOI: 10.1007/978-3-540-78246-9_71

Google Scholar

[14] A. Frank, A. Asuncion: UCI Machine Learning Repository [http: /archive. ics. uci. edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (2010).

Google Scholar