A Locality Sensitive Hashing Technique for Categorical Data

Kyung Mi Lee; Keon Myung Lee

doi:10.4028/www.scientific.net/AMM.241-244.3159

Paper Titles

Web Integrating Analysis Based on XML and Ontology
p.3138

Automatic Tortuosity Classification Using Machine Learning Approach
p.3143

Web3D Technology Research Based on the HTML5
p.3148

Online Visual Inspectionsystem for OLED Defects
p.3153

A Locality Sensitive Hashing Technique for Categorical Data
p.3159

Efficient Identification of Frequent Family Subtrees in Tree Database
p.3165

RETRACTED: Deconstructing Web Services
p.3171

Service Discovery Architecture Applied in Cloud Computing Environments
p.3177

Applying the Technology of Internet of Things to Urban Pipeline Gas Metering via Mobile Data Acquisition
p.3184

HomeApplied Mechanics and MaterialsApplied Mechanics and Materials Vols. 241-244A Locality Sensitive Hashing Technique for...

A Locality Sensitive Hashing Technique for Categorical Data

Abstract:

The measured data may contain various types of attributes such as continuous, categorical, and set-valued attributes. Several locality-sensitive hashing techniques, which enable to find similar pairs of data in a fast and approximate way, have been developed for data with either numeric or set-valued attributes. This paper introduces a new locality sensitive-hashing technique applicable to data with categorical attributes.

You might also be interested in these eBooks

Industrial Instrumentation and Control Systems

View Preview

Info:

Periodical:

Applied Mechanics and Materials (Volumes 241-244)

Pages:

3159-3164

DOI:

https://doi.org/10.4028/www.scientific.net/AMM.241-244.3159

Citation:

Cite this paper

Online since:

December 2012

Authors:

Kyung Mi Lee, Keon Myung Lee

Keywords:

Categorical Data, Data Analysis, Locality Sensitive Hashing, Similar Pair Identification

Export:

RIS, BibTeX

Price:

Permissions CCC:

Request Permissions

Permissions PLS:

Request Permissions

Сopyright:

Citation:

References

[1] A. Rajaraman and J. D. Ullman: Mining of Massive Datasets, Cambridge University Press (2012).

Google Scholar

[2] S. Boriah, V. Chandola, V. Kumar: Similarity Measures for Categorical Data: A Comparative Evaluation, Proc. of the 8th SIAM Int. Conf. on Data Mining (2008) 243-254.

DOI: 10.1137/1.9781611972788.22

Google Scholar

[3] U. Manber: Finding similar files in a large file system, Proc. USENIX Conference (1994) 1–10.

Google Scholar

[4] A. Z. Broder: On the resemblance and containment of documents, Proc. Compression and Complexity of Sequence (1997) 21–29.

Google Scholar

[5] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher: Min-wise independent permutations, ACM Symposium on Theory of Computing (1998) 327–336.

DOI: 10.1145/276698.276781

Google Scholar

[6] A. Andoni and P. Indyk: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions, Comm. ACM, 51(1) (2008) 117–122.

DOI: 10.1145/1327452.1327494

Google Scholar

[7] M. S. Charika: Similarity estimation techniques from rounding algorithms, ACM Symposium on Theory of Computing (2002) 380–388.

Google Scholar

[8] K. M. Lee, K. M. Lee: Fuzzy Technique-based Identification of Close and Distant Clusters in Clustering, Int. J. of Fuzzy Logic and Intelligent Systems 11(3) (2011) 165–170.

DOI: 10.5391/ijfis.2011.11.3.165

Google Scholar

[9] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni: Locality-sensitive hashing scheme based on p-stable distribution, Symp. on Computational Geometry (2004) 253–262.

DOI: 10.1145/997817.997857

Google Scholar

[10] M. Theodbald, J. Siddhaarth, and A. Paepcke: SpotSigs - robust and efficient near duplicate detection in large web collections, 31st Annual ACM SIGIR Conference, Singapore, July (2008).

DOI: 10.1145/1390334.1390431

Google Scholar

[11] S. Chauddhuri, V. Ganti, and R. Kaushik: A primitive operattor for similarity joins in data cleaning, Proc. Int. Conf. on Data Engineering (2006).

DOI: 10.1109/icde.2006.9

Google Scholar

[12] F. Hao, J. Daugman, and P. Zielinski: A fast search algorithm for a large fuzzy database, IEEE Trans. on Information Forensics and Security 3(2) (2008).

DOI: 10.1109/tifs.2008.920726

Google Scholar

[13] M. Potthast and B. Stein: New issues in near-duplicate detection, Data Analysis, Machine Learning and Applications, Springer (2008) 601–609.

DOI: 10.1007/978-3-540-78246-9_71

Google Scholar

[14] A. Frank, A. Asuncion: UCI Machine Learning Repository [http: /archive. ics. uci. edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (2010).

Google Scholar