تقطیع هجایی گفتار پیوسته فارسی با استفاده از آستانهگذاری ضرایب موجک و نرمسازی فازیِ تابع انرژی
محورهای موضوعی : انرژی های تجدیدپذیر
1 - کارشناس ارشد/شرکت جویشگر ریزگستر مستقر در شهرک علمی و تحقیقاتی اصفهان
2 - دانشکده برق، واحد نجف آباد، دانشگاه آزاد اسلامی
کلید واژه: واکه, همخوان, تبدیل موجک, تقطیع هجایی, آستانهگذاری ضرایب موجک, فیلتر فازی, انرژی زمان کوتاه,
چکیده مقاله :
امروزه در تحقیقات حوزه پردازش و بازشناخت گفتار، هجا به دلیل ارتباط قوی آن با تولید و ادراک گفتار در انسان، به عنوان یک واحد زیرکلمهای هر روز بیشتر مورد توجه قرار میگیرد. آشکارسازی خودکار مرزهای هجایی گامی مهم در تحقیقات مرتبط با نوای گفتار، تولید گفتار طبیعی و حتی بازشناسی گفتار است. در این مقاله روش جدیدی برای آشکارسازی خودکار مرزهای هجایی در سیگنال گفتار پیوسته فارسی با تکیه بر اطلاعات صوتی ارائه شده است. تحقیقات قبلیِ نویسندگان این مقاله، کارآیی نرمسازی فازیِ تابع انرژی را در مقایسه با سایر روشهای به کار رفته در این زمینه نشان میدهد. در این تحقیق، پیشنهاد شده است که از روشی مشابه روشهای متداول حذف نویز از گفتار به وسیله آستانه گذاری ضرایب موجک برای بهبود خطای درج مرز اضافه استفاده شود. این روند، انرژی همخوانهای بیواکی را که در تابع انرژی قلههای اضافه ایجاد میکنند، به شدت کاهش میدهد. نتایج نشان میدهند با استفاده همزمان از این روش و روش نرمسازی فازی تابع انرژی، خطای درج مرز اضافه در حدود %8 کاهش مییابد؛ بدون آنکه سایر معیارهای کارآیی تحت تأثیر قرار گیرند. با استفاده از روش پیشنهادی بیش از %94 از هجاها با خطایی کمتر از 50 میلیثانیه تقطیع میشوند.
Syllable, as a sub-word unit, nowadays plays an active role in the field of speech processing and recognition research according to its robust relation to human speech production and cognition. Automatic syllable boundaries detection is an important step forward in the areas of speech prosody, natural speech synthesis and speech recognition. In this paper, a novel method in automatic syllabification of Farsi continuous speech based on acoustic structure is proposed. Our previous studies, showed the proficiency of energy contour fuzzy smoothing method, compared with other prominent works in this area. This paper suggests that the conventional methodology-used in speech enhancement based on wavelet coefficient thresholding would improve syllable segmentation by decreasing insertion error. This process declines the energy in high energy consonants which are responsible for extra peaks in short term energy contour. Experimental results showed that utilizing proposed method along with fuzzy smoothing would diminish insertion error about 8% with no reasonable effect on other efficiency criteria. More than 94% of syllables are automatically segmented using presented technique with less than 50ms error.
[1] Z. Hu, J. Sehalkwyk, E. Barnard, R. Cole, "Speech recognition using syllable-like units", In Proc. of Int. Conf. speech and language processing (ICSLP), Vol. 2, pp.1117-1120, 1996.
[2] R. Cole, B. Oshika, M. Noel et al, "Labeler agreement in phonetic labeling of continuous speech", In Proc. Inter. Conf. on spoken language processing, pp.2131-2134, 1994.
[3] O. Ghitza, M. Sondhi, "Hidden Markove models with templates as non-stationary states: an application to speech recognition", J. of Computer speech and language, Vol.2, pp.101-119, 1993.
[4] Z. Hu, E. Bernard, R. Cole, "Transition-based feature extraction within frame-based recognition", In Proc. Eurospeech conf., pp.1555-1558, 1995.
[5] M. Ostendorf, S. Roukos, "A stochastic segment model for phoneme-based continuous speech recognition", IEEE Trans. on Acoustic, speech and signal processing, Vol. 37, No. 12, pp.1857-1869, 1989.
[6] J. Sirigos, N. Fakotakis, G. Kokkonakis, "A hybrid syllable recognition system based on vowel spotting", J. of Speech communication, Vol. 38, pp.427-440, 2002.
[7] T. Nagarajan, H. A. Murthy, "Language identification using parallel syllable-like unit recignition", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol.1, pp.401-404, 2004.
[8] S. Greenberg, "Speaking in short hand- A syllable centric perspective for understanding pronunciation variation", In. Proc. ESCA workshop on modeling pronunciation variation for automatic speech recognition, 1998.
[9] W. Reichel, G. Ruske, "Syllable segmentation of continuous speech with artificial neural network", In Proc. Eurospeech Conf., pp.1771-1774, 1993.
[10] K. Kirchhoff, "Syllable-level desynchronization of phonetic features for speech recognition", In Proc. of Int. Conf. speech and language processing (ICSLP), Vol. 4, pp.2274-2276, 1996.
[11] M. J. Hunt, M. Lening, P. Mermelstein, "Experiments in syllable-based recognition of continuous speech", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol.3, pp.880-883, 1980.
[12] S. L. Wu, M. L. Shire, S. Greenberg et al, "Integration syllable boundary information into speech recognition", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol.2, pp.987-990, 1997.
[13] R. Janakiraman, J. C. Kumar, H. A. Murthy, "Robust syllable segmentation and its application to syllable-centric continuous speech recognition", In Proc. of IEEE Nat. Conf. on communications, India, pp.1-5, 2010.
[14] H. Tolba, M. Azmi, "Comparative experiments to evaluate the use of syllable for large-vocabulary automatic speech recognition", In Proc. Of IEEE Int. Conference on Computer Science and Information Technology, pp.250-253, 2009
[15] V. Barkhoda, A. Bahrampour et al, "A comparative study on quality of different text to speech systems based on variant speech units for Kordi language", In Proc. 12th Iranian student Conf. on Electrical engineering (ISCEE), Tabriz, 2009.
[16] M. Bacchiani, M. Ostendorf, "Design of a speech recognition system based on acoustically derived segmental units", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol.1, pp.443-446, 1996.
[17] A. Ganapathiraju, J. Hamaker et al, "Syllable-based large vocabulary continuous speech recognition", IEEE Trans. on Acoustic, speech and signal processing, Vol. 9, pp.358-366, 2001.
[18] V. K. Prasad, T. Nagarajan, H. A. Murthy, "Continuous speech recognition using automatically segmented data at syllabic units", Proc. of Int. Conf. on signal processing, Vol. 1, pp.235-238, 2002.
[19] H. N. Ting, Y. Jasmy, S. Hossein et al, "Malay syllable recognition based on multilayer perceptron and dynamic time warping", Proc. of Int. Symp. on signal processing and its application, Vol. 2, pp.743-744, 2001.
[20] H. Matsu'ura, T. Nitta, S. Hirai et al, "A large vocabulary word recognition system based on syllable recognition and nonlinear word matching", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol.1, pp.183-186, 1988.
[21] A. Tanaka, S. Kamiya, "A speech processing system based on syllable identification by using phonological patterns", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), pp.2231-2234, 1986.
[22] S. Zhang, Q. Shi, Y. Qin, "Modeling syllable-based pronunciation variation for accented Mandarin speech recognition", In Proc. of IEEE Int. Conf. on Pattern recognition, pp.1606-1609, 2010.
[23] N. T-Umpon, S. Chansareewittaya, S. Auephanwiriyakul, "Phoneme and tonal accent recognition for Thai speech", Elsevier J. Expert systems with applications, Vol. 38, pp.13254-13259, 2011.
[24] W. Hu, T. Huang, B. Xu, "Study on prosodic boundary location in Chinese Mandarin", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol. 1, pp.501-504, 2002.
[25] D. Wang, L. Lu, H. J. Shang, "Speech segmentation without speech recognition", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol. 1, pp.468-471, 2003.
[26] F. Tamborini, "Prosodic prominence detection in speech", Proc. of Int. Symp. on Signal processing and its application, Vol.1, pp.385-388, 2003.
[27] S. Kim, "The role of prosodic cues in word segmentation of Korean", In Proc. InterSpeech Conf., pp.3005-3008, 2004.
[28] K. P. Li, "Automatic language identification using syllabic features", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), pp.297-300, 1994.
[29] A. Noetzel, "Robust syllable segmentation of continuous speech using neural networks", In Proc. IEEE Conf. Electro International, pp.580-585, 1991.
[30] P. Nel, J. D. Preez, "Automatic syllabification using hierarchical hidden Markove models", In Proc. IEEE Int. Conf. on Acoustic, speech and signal processing (ICASSP), Vol. 1, pp.768-771, 2003.
[31] N. Jittiwarangkul, S. Jitapunkul et al, "Thai syllable segmentation for connected speech based on energy", Proc. of IEEE Asia-Pacific Conf. on Circuits and systems, pp.169-172, 1998.
[32] V. K. Prasad, T. Nagarajan, H. A. Murthy, "Automatic segmentation of continuous speech using minimum phase group delay function", Jou. of Speech communication, Vol. 42, pp.429-446, 2004.
[33] G. Sheikhi, F. Almasganj, "Segmentation of speech into syllable units using fuzzy smoothed short term energy contour", In Proc. of 18th Iranian Conf. of Biomedical engineering, pp.195-198, 2011.
[34] D.L. Donoho, "De-noising by soft thresholding", J. IEEE Trans. on Information Theory, Vol. 41, No. 3, pp. 613-627, 1995.
[35] Y.S. Ing, N.K. Soo, K.Y. Chai, "Wavelet for speech denoising", Proc. of IEEE. TENCON'97 Conf., Vol. 2, pp.479- 482, 1997.
[36] M. T. Johnson, X. Yuan, Y. Ren, "Speech signal enhancement through adaptive wavelet thresholding", Jou. of Speech communication, Vol. 49, No. 2, pp.123-133, 2007.
[37] Y. Ghanbari, M. R. Karami-Mollaei, "A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets", Jou. of Speech communication, Vol. 48, No. 8, pp.927-940, 2006.
[38] G. Sheikhi, "Syllable boundary detection and analysis in Farsi connected speech using robust features and prosodic cues", M.Sc. thesis, Biomedical engineering department, Amirkabir University of Technology, March 2007.
[39] H. M. Teager, S. M. Teager, "A Phenomenological Model for Vowel Production in the Vocal Tract", Ch. 3, pp. 73-109, College-Hill Press, 1983.
[40] G. Sheikhi, F. Almasganj, "Syllable segmentation of Farsi connected speech using variable threshold", In Proc. 12th annual conference of computer society of Iran (ACCSI), Feb. 2007, Iran.
[41] M. Bahoura, J. Rouat, "Wavelet speech enhancement based on time–scale adaptation", Jou. of Speech communication, Vol. 48, No. 12, pp.1620-1637, 2006.
[42] X. P. Zhang, M. D. Desai, "Adaptive denoising based on SURE risk", Jou. of IEEE Signal Processing Letters, Vol. 5, No.10, pp.265-267, 1998.
[43] S. Tabibian, B. Zamani Dehkordi et al, "A proposed wavelet basis matched to speech signal for enhancement and evaluation of effective parameters", In Proc. 12th annual conference of computer society of Iran (ACCSI), Kish, March 2008.
[44] M. Bijankhan, M. J. Sheikhzadegan, "FARSDAT- the Farsi Spoken Language Database", In Proc. of the 5th International Conf. on Speech Sciences and Technology, Perth, Australia, Vol. 2, pp.826-829, 1994.
_||_