A Novel Weighted Dynamic Time Warping for Light Weight Speaker-Dependent Speech Recognition in Noisy and Bad Recording Conditions

Article Preview

Abstract:

Lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a promising solution for the problems of possibility of disclosing personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English) names. Dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications, which have limited storage space and small vocabulary. In our previous work, we have successfully developed two fast and accurate DTW variations for clean speech data. However, speech recognition in adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy and bad recording conditions, such as too high or low recording volume, we introduce a novel weighted DTW method. This method defines a feature index for each time frame of training data, and then applies it to the core DTW process to tune the final alignment score. With extensive experiments on one representative SD dataset of three speakers' recordings, our method achieves better accuracy than DTW, where 0.5% relative reduction of error rate (RRER) on clean speech data and 7.5% RRER on noisy and bad recording speech data. To the best of our knowledge, our new weighted DTW is the first weighted DTW method specially designed for speech data in noisy and bad recording conditions.

You might also be interested in these eBooks

Info:

Periodical:

Pages:

1347-1355

Citation:

Online since:

January 2014

Export:

Price:

Permissions CCC:

Permissions PLS:

Сopyright:

© 2014 Trans Tech Publications Ltd. All Rights Reserved

Share:

Citation:

* - Corresponding Author

[1] S. Furui, History and development of speech recognition, Speech Technology, no. doi: 10. 1007/978-0-387-73819-2-1., (2010).

Google Scholar

[2] S. V. Chapaneri, Spoken digits recognition using weighted MFCC and improved features for dynamic time warping, International Journal of Computer Application, vol. 40, no. 3, pp.6-12, (2012).

DOI: 10.5120/5022-7167

Google Scholar

[3] R. V. Cox, C. A. Kamm, L. R. Rabiner, J. Schroeter and J. G. Wilpon, Speech and language processing for next-millennum communications services, Proc. of the IEEE, vol. 88, no. 8, pp.1314-1337, (2000).

DOI: 10.1109/5.880086

Google Scholar

[4] N. Y. Talking, Powerful New Language Tools Leverage AI, IEEE Intelligent Systems, vol. 27, no. 2, pp.2-7, (2012).

Google Scholar

[5] G. E. Hinton, S. Osindero and Y. W. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, vol. 18, no. 7, pp.1527-1554, (2006).

DOI: 10.1162/neco.2006.18.7.1527

Google Scholar

[6] G. E. Dahl, D. Yu, L. Deng and A. Acero, Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition, IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 1, pp.30-42, (2012).

DOI: 10.1109/tasl.2011.2134090

Google Scholar

[7] J. Sun, Y. Sun, K. Abida and F. Karray, A novel template matching approach to speaker-independent arabic spoken digit recognition, in AIS 2012, Aveiro, Portugal., (2012).

DOI: 10.1007/978-3-642-31368-4_23

Google Scholar

[8] S. Kim, S. Park and W. Chu, An index-based approach for similarity search supporting time warping in large sequence databases, in Data Engineering, 2001 Proc. 17 th Conf. on, Heidelberg, Germany, (2001).

DOI: 10.1109/icde.2001.914875

Google Scholar

[9] Y. Zhu and D. Shasha, Warping indexes with envelope transforms for query by humming, in SigMOD, San Diego, CA, (2003).

DOI: 10.1145/872757.872780

Google Scholar

[10] M. Muller, H. Mattes and F. Kurth, An efficient multiscale approach to audio synchronization, in Proc. ISMIR, Victoria, BC, Canada., (2006).

Google Scholar

[11] Y. Sakurai, M. Yoshikawa and C. Faloutsos, FTW: fast similarity search under the time warping distance, in PODS, Baltimore, Maryland., (2005).

DOI: 10.1145/1065167.1065210

Google Scholar

[12] P. Papapetrou, V. Athistsos, M. Potamias, G. Kollios and D. Gunopulos, Embedding-based supsequence matching in time-series databases, " ACM Trans. on Database Systems, vol. 36, no. 3, p.17: 1-17: 39, 2011. A. Shanker and A. Rajagopalan, "Off-line signature verification using DTW, Pattern Recognition Letters, vol. 28, pp.1407-1414, (2007).

DOI: 10.1016/j.patrec.2007.02.016

Google Scholar

[13] Jeong, Y. S., M. K. Jeong and O. A. Omitaomu, Weighted dynamic time warping for time series classification, Pattern Recognition, vol. 44, pp.2231-2240, (2011).

DOI: 10.1016/j.patcog.2010.09.022

Google Scholar

[14] X. Zhang, J. Sun, Z. Luo and M. Li, Confidence Index Dynamic Ttime Warping for Language-Independent Embedded Speech Recognition, in ICASSP, Vancouver, Canada, (2013).

DOI: 10.1109/icassp.2013.6639236

Google Scholar

[15] X. Zhang, J. Sun, Z. Luo and M. Li, Merge-weighted Dynamic Time Warping for Language-Independent Speaker-Dependent Embedded Speech Recognition, Journal of Computer Sicence and Techonology, 2013 (submitted).

Google Scholar

[16] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland, The HTK Book (for HTK Version 3. 4), Cambrideg, UK: Cambridge University Engineering Department, 2006, p.349.

Google Scholar

[17] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs, New Jersey: Prentice Hall, (1993).

Google Scholar

[18] C. Levy, G. Linares and P. Nocera, Comparison of Several Acoustic Modeling Techniques and Decoding Algorithms for Embedded Speech Recognition Systems, in Workshop on DSP in Mobile and Vehicular Systems, Nagoya, Japan, (2003).

Google Scholar