Recognizing the Emotional State Changes in Human Utterance by a Learning Statistical Method based on Gaussian Mixture Model
Subject Areas : Image, Speech and Signal ProcessingReza Ashrafidoost 1 * , Saeed Setayeshi 2 , Arash Sharifi 3
1 - Department of Computer Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
2 - Amirkabir University of Technology, Tehran, Iran
3 - Department of Computer Science, Science and Research Branch, Islamic Azad University, Tehran, Iran
Keywords: speech processing, Gaussian mixture model, mel frequency cepstral coefficient, emotional states, Pattern Recognition,
Abstract :
Speech is one of the most opulent and instant methods to express emotional characteristics of human beings, which conveys the cognitive and semantic concepts among humans. In this study, a statistical-based method for emotional recognition of speech signals is proposed, and a learning approach is introduced, which is based on the statistical model to classify internal feelings of the utterance. This approach analyzes and tracks the emotional state changes trend of speaker during the speech. The proposed method classifies utterance emotions in six standard classes including, boredom, fear, anger, neutral, disgust and sadness. For this purpose, it is applied the renowned speech corpus database, EmoDB, for training phase of the proposed approach. In this process, once the pre-processing tasks are done, the meaningful speech patterns and attributes are extracted by MFCC method, and meticulously selected by SFS method. Then, a statistical classification approach is called and altered to employ as a part of the method. This approach is entitled as the LGMM, which is used to categorize obtained features. Aftermath, with the help of the classification results, it is illustrated the emotional states changes trend to reveal speaker feelings. The proposed model also has been compared with some recent models of emotional speech classification, in which have been used similar methods and materials. Experimental results show an admissible overall recognition rate and stability in classifying the uttered speech in six emotional states, and also the proposed algorithm outperforms the other similar models in classification accuracy rates.
[1] Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G., “Emotion recognition in human-computer interaction”, IEEE Signal Processing magazine, vol. 18, no. 1, pp. 32-80, January 2001.
[2] S. Wu, T. H. Falk, W. Chan, “Automatic speech emotion recognition using modulation spectral features”, Journal of Speech Communication, May, 2011, vol. 53, pp. 768–785.
[3] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, “Speaker Diarization: A Review of Recent Research”, IEEE Transactions on Audio, Speech, and Language Processing. DOI: 10.1109/TASL.2011.2125954.
[4] C.N. Van der Wal, W. Kowalczyk, “Detecting Changing Emotions in Human Speech by Machine and Humans”, Springer Science and Business Media, NY - Applied Intelligence, December 2013. DOI: 10.1007/s10489-013-0449-1.
[5] B. Fergani, M. Davy, and A. Houacine, “Speaker diarization using one-class support vector machines,” Speech Communication, vol. 50, pp. 355-365, 2008. DOI: 10.1016/j.specom.2007.11.006.
[6] F. Valente, “Variational Bayesian Methods for Audio Indexing,” PhD. dissertation, Universite de Nice-Sophia Antipolis, 2005. DOI: 10.1007/11677482_27.
[7] P. Kenny, D. Reynolds, and F. Castaldo, “Diarization of telephone conversations using factor analysis,” Selected Topics in Signal Processing, IEEE Journal of, vol. 4, pp. 1059-1070, 2010.
[8] S. Davis, P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Audio Speech Language Processing. 28, 357–366, 1980.
[9] Wojtek Kowalczyk, C. Natalie van der Wal., “Detecting Changing Emotions in Natural Speech”, Springer Science Business Media New York, Appl Intell (2013) 39:675–691 DOI:10.1007/s10489-013-0449-1.
[10] Sara Motamed, Saeed Setayeshi, “Speech Emotion Recognition Based on Learning Automata in Fuzzy Petri-net”, Journal of mathematics and computer science, vol. 12, August 2014.
[11] Rahul B. Lanjewar, Swarup Mathurkar, Nilesh Patel, “Implementation and Comparison of Speech Emotion Recognition System using Gaussian Mixture Model (GMM) and K-Nearest Neighbor (K-NN) techniques,” In Procedia Computer Science 49 (2015) pp. 50-57, DOI: 10.1016@j.procs.2015.04.226, 2015.
[12] J. H. Wolfe, “Pattern clustering by multivariate analysis,” Multivariate Behavioral Research, vol. 5, pp. 329-359, 1970.
[13] D. Ververidis, C. Kotropoulos, “Emotional Speech Classification Using Gaussian Mixture Models and the Sequential Floating Forward Selection Algorithm,” IEEE International Conference on Multimedia and Expo, Amsterdam, 2005. DOI:10.1109/ICME.2005.1521717.
[14] H. Farsaie Alaie, L. Abou-Abbas, C. Tadj, “Cry-based infant pathology classification using GMMs,” Speech Communication (2015), DOI: 10.1016/j.specom.2015.12.001, 2015.
[15] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B., “A database of german emotional speech”, INTERSPEECH, pp.1517–1520, 2005.
[16] Yang, M.Lugger, “Emotion recognition from speech signals using new harmony features,” Special Section on Statistical Signal & Array Processing, vol. 90, Issue 5, May 2010, pp. 1415–1423, DOI: 10.1016/j.sigpro.2009.09.009.
[17] A. Rabiee, S. Setayeshi, “Robust and optimum features for Persian accent classification using artificial neural network,” in the proceedings of the 19th international conference on Neural Information Processing - Volume Part IV. DOI: 10.1007/978-3-642-34478-7_54.
[18] J. Kittler, “Feature set search algorithms,” Journal of Pattern Recognition and Signal Process, 1978, pp. 41–60.
[19] R. Ashrafidoost, S. Setayeshi, “A Method for Modelling and Simulation the Changes Trend of Emotions in Human Speech”, In Proc. of 9th European Congress on Modelling and Simulation (Eurosim), Sep.2016, p.444-450, DOI:10.1109/EUROSIM.2016.30.
[20] L. R. Welch, “Hidden Markov models and the Baum-Welch algorithm,” IEEE Information Theory Society Newsletter vol. 53, pp. 1, 10-13, Dec 2003.
[21] B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine - belief network architecture,” in Proc. 2004 IEEE Int. Conf. Acoustics, Audio and Signal Processing, May 2004, vol. 1, pp. 577-580.
[22] C. M. Lee, S. Narayanan, “Towards detecting emotion in spoken dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, 2005. DOI: 10.1016/j.specom.2010.08.013
[23] A. B. Ingale, D. S. Chaudhari, “Speech Emotion Recognition,” International Journal of Soft Computing and Engineering (IJSCE), Volume-2, Issue-1, March 2012.
[24] A. S. Utane, S. L. Nalbalwar, “Emotion Recognition through Speech Using Gaussian Mixture Model and Support Vector Machine,” International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013.
[25] R. B. Lanjewar, S. Mathurkar, N. Patel, “Implementation and Comparison of Speech Emotion Recognition System using Gaussian Mixture Model (GMM) and K-Nearest Neighbour (NKK) Techniques,” Elsevier, Procedia Computer Science, December 2015, DOI: 10.1016/j.procs.2015.04.226.