Suffix Tree Constructing Algorithm for Datasets with Discrete Contents

Article Preview

Abstract:

The suffix tree is a useful data structure constructed for indexing strings. However, when it comes to large datasets of discrete contents, most existing algorithms become very inefficient. Discrete datasets are need to be indexed in many fields like record analysis, data analyze in sensor network, association analysis etc. This paper presents an algorithm, STD, which stands for Suffix Tree for Discrete contents, that performs very efficiently with discrete input datasets. It imports several wonderful intermediate data structures for discrete strings; we also take care of the situation that the discrete input strings have similar characteristics. Moreover, STD keeps the advantages of existing implementations which are for successive input strings. Experiments were taken to evaluate the performance and shown that the method works well.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

867-870

Citation:

Online since:

February 2015

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2015 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

[1] McCreight E M: A space-economical suffix tree construction algorithm [J]. Journal of the ACM (JACM), Vol. 23 (1976) No. 2, p.262.

DOI: 10.1145/321941.321946

Google Scholar

[2] Charras C, Lecroq T: Handbook of exact string matching algorithms [M] (King's College Publications, 2004).

Google Scholar

[3] Mansour E, Allam A, Skiadopoulos S, et al: ERA: efficient serial and parallel suffix tree construction for very long strings [J]. Proceedings of the VLDB Endowment, Vol. 5 (2011) No. 1, p.49.

DOI: 10.14778/2047485.2047490

Google Scholar

[4] Ukkonen E: On-line construction of suffix trees [J]. Algorithmica, Vol. 14 (1995) No. 3, p.249.

DOI: 10.1007/bf01206331

Google Scholar

[5] Hunt E, Atkinson M P, Irving R W: Database indexing for large DNA and protein sequence collections [J]. The VLDB Journal, Vol. 11 (2002) No. 3, p.256.

DOI: 10.1007/s007780200064

Google Scholar

[6] Tata S, Hankins R A, Patel J M: Practical suffix tree construction[C]. Proceedings of the Thirtieth international conference on Very large data bases (Toronto, Canada, August 31-September 3, 2004), Vol. 30 (2004) p.36.

DOI: 10.1016/b978-012088469-8.50007-3

Google Scholar

[7] Tian Y, Tata S, Hankins R A, et al: Practical methods for constructing suffix trees [J]. The VLDB Journal, Vol. 14 (2005), No. 3, p.281.

DOI: 10.1007/s00778-005-0154-8

Google Scholar

[8] Phoophakdee B, Zaki M J: Genome-scale disk-based suffix tree indexing[C]. Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, (2007) p.833.

DOI: 10.1145/1247480.1247572

Google Scholar

[9] Phoophakdee B, Zaki M J: TRELLIS+: an effective approach for indexing genome-scale sequences using suffix trees[C]. Pacific Symposium on Biocomputing, Vol. 13 (2008), p.90.

DOI: 10.1142/9789812776136_0011

Google Scholar

[10] Ghoting A, Makarychev K: Serial and parallel methods for I/O efficient suffix tree construction[C]. Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, (2009) p.827.

DOI: 10.1145/1559845.1559931

Google Scholar

[11] Barsky M, Stege U, Thomo A, et al: Suffix trees for very large genomic sequences[C]. Proceedings of the 18th ACM conference on Information and knowledge management. ACM, (2009), p.1417.

DOI: 10.1145/1645953.1646134

Google Scholar

[12] Weiner P. Linear pattern matching algorithms[C], Switching and Automata Theory, 1973. SWAT'08. IEEE Conference Record of 14th Annual Symposium on. IEEE, (1973) p.1.

DOI: 10.1109/swat.1973.13

Google Scholar

[13] Schürmann K B, Stoye J: Suffix tree construction and storage with limited main memory [J]. (2003).

Google Scholar

[14] Japp R: The top-compressed suffix tree[C]. 21st Annual British National Conference on Databases. (2004).

Google Scholar