• Home
  • Recognizing the Emotional State Changes in Human Utterance by a Learning Statistical Method based on Gaussian Mixture Model

Share To

Article Url


Manuscript ID : JACET-1611-1093 (R3) Visit : 125 Page: 113 - 124

Article Type: Original Research

 

comparison-chart.jpg 

Fig. 6: Comparison chart of recognition rates obtained for each emotion

 

 

These assessments are provided for each of six emotional states in the domain of emotional classes, according to the equation 14. The results are obtained based on performance of the proposed method on Berlin emotional speech corpus (EmoDB). The acquired recognition accuracy rates are represented in the table 1.

Consequently, it is calculated the analytical and statistical parameters, which stem from obtained result. it has been achieved to the score 2.52 as the standard deviation in accuracy rates of the six emotional states. This result shows that the proposed approach represents high degree of stability in the speech emotion recognition. Also, it is achieved the scores, 6.36, 0.0298, 6.98 and 84.74 as variance, dispersion coefficient, variation range and geometric mean, respectively. As well as, the acquired dispersion coefficient also emphasizes on the sustainability of the proposed system.

It is also clearly depicted the comparison of the recognition rates in the Fig. 6. This diagram illustrates that the highest rate of emotion recognition is achieved in evaluating emotions of anger and boredom, and the lowest one belongs to the emotion of fear. This result eventually is expected and probable because of the distinctive attributes of the emotions of anger and Boredom, and Fear. Accordingly, the attributes of these emotions, which extracted from acoustical signal of speech, are similar that conform to the same pattern. This similarity in patterns obviously can be deduced from this comparison diagram.

 

2.       Evaluation and Comparison

There are various kinds of emotional speech recognition systems, which have been proposed based on well-known classifiers such as, Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), K-nearest neighbors (KNN), Support Vector Machine (SVM), and Artificial Neural Network (ANN), and also the combination of these methods. Each of them, has advantages and restrictions. In this paper, it is applied the GMM classifier as the basis of the proposed model, in which it is achieved reliability and stability over the other similar speech emotion recognition (SER) systems.

One of the main concerns that earlier researches came across in the emotional speech recognition systems, was that there is a large difference between accuracy rates among the results for emotions recognition. But, in according to the test results, the proposed system shows stable upshots, which obviates this chronic problem in emotional speech recognition researches.

 

 

Table 2: Recognition rates (%) comparison

                 Emotion

Classifier

Angry

Boredom

Disgust

Fear

Neutral

Sadness

LGMM

82.94

86.78

83.81

80.54

86.87

87.52

GMM

92

-

-

50

73

89

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In this results, the accuracy rates are not so far from each other and the recognition rates for all six emotions are close together. This outcome indicates that the proposed system shows the reliable results to recognize different emotions, which contained in single speech signal.

Some of similar articles have been reviewed, in which the GMM-based classifier and EmoDB are used to recognize emotions from the speech signals. First, in 2012, Ashish B. Ingale et al. proposed a method, in which they have used mixture of classification approaches, and then compared the results by applying them on the same speech corpus (EmoDB), and obtained acceptable results [23]. In this research, the GMM method has been compared with ANN-based, HMM-based and SVM-based methods and their results. In according to this article, the best performance belongs to GMM-based model, whereas in the speaker independent recognition (similar to our method) they have attained the minimum recognition accuracy rate for the best features as 78.77%. In the other study, in the year 2013, the authors proposed a model of GMM-SVM based approach to recognize emotional speech signals. According to the results, they have been achieved 78.27% of recognition accuracy         for the neutral emotion in the finest situation [24]. By reviewing the mentioned models, it shows that our proposed approach outperforms these methods in regards to the average accuracy rates.

It is worth noting that the most important contribution of our work is to propose a learning-based GMM method, which in according to the accuracy rates, its stability is the forte point of this approach. As mentioned, we have achieved the better average accuracy rate results in compare to mentioned analogous researches.

On the other hand, to emphasize on our achievements in recognition stability, we have reviewed wide range of recent similar articles in the field of emotional speech recognition, in which they have applied GMM and EmoDB as well. In 2015, R. Lanjewar et al. published an article, in which they recognized emotional states in speech using the GMM and K-NN. They also used the EmoDB for train and test phases as it has been applied in this paper. Based on their results, they have achieved good results in accuracy rates in some of emotional states such as anger and sadness, 92% and 89%, respectively [25]. There are better results in compare with our work in these two emotions, but in according to the accuracy rates for the other four comparable

 

 

 

 

 

 

 

 

 

 

 

 

 

emotional states, our proposed approach shows the better performance in accuracy rates. The point is, the outcome, which it has been attained in all emotional states in this paper, are free of huge gaps in recognition system, whereas in the mentioned research there are the huge furrows, about 67%, between recognition rates (table 2). This advantage of our proposed approach emphasizes on stability of this method for all emotional states in speech recognition and indicates the reliability of the proposed system.

 

VI.       Conclusions

It has been demonstrated an approach for speech emotion recognition (SER) using the innovative classification method, which is based on a probabilistic method to obtain changes trend of the speaker emotional states. To this end, it has been applied a modified version of GMM, as a basis for this approach of emotion classification, which has been entitled as the Learning Gaussian Mixture Model (LGMM). Also, it has been used 12-MFCC method to extract speech features, and have used SFS method to select the features efficiently from the raw audio signal of speech. Besides, the Berlin emotional speech corpus database (EmoDB) has been applied for training and testing the proposed method of emotion recognition. The main motivation of this research is to recognize the trend of feeling changes in emotions of speaker during the speech. A prominent advantage using this method is to depict a clear and informative view of emotional behavior of the speaker, regardless to the speech context, instant deeds or contrived behaviors during the speech, with high degree of accuracy rates. This approach benefits of using MFCC in feature extraction and its combination with SFS, which leads to more accurate results. This method of feature extraction also demonstrates acceptable performance in noisy environments. However, the recognition accuracy could be decreased a bit in the very noisy situations. Compared to the conventional methods in the field of emotional speech recognition, despite of the limited number of train and test samples in the database, the obtained results using the proposed approach allows us to achieve admissible consequences in recognition accuracy and run time. 

This work could be explored and that could perhaps further improve the classification accuracy or reduce the computational complexity in the future experiments. Despite the stability of our approach, it could be enhanced to achieve higher average accuracy rates with the same admissible stability. In order to improve the performance of this emotional speech recognition system, the following potential extensions are proposed. There exist speech feature classification techniques such as Hidden Markov Model (HMM), Support Vector Machine (SVM) and Probabilistic Neural Network (PNN) proposed in recent papers, which could use to help improvements of classification part of our approach.

So, the performance and capabilities of the recognition rate could be further improved. Also, further research on the cognitive characteristics of the speech signals in emotional and conceptual classification methods could be expedient. As a final point, there are only few studies that deliberated applying multiple classifier system for speech emotion recognition. It is believed that this research direction has to be further explored.

 

References

[1]       Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. G., “Emotion recognition in human-computer interaction”, IEEE Signal Processing magazine, vol. 18, no. 1, pp. 32-80, January 2001.

[2]       S. Wu, T. H. Falk, W. Chan, “Automatic speech emotion recognition using modulation spectral features”, Journal of Speech Communication, May, 2011, vol. 53, pp. 768–785.

[3]       X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, “Speaker Diarization: A Review of Recent Research”, IEEE Transactions on Audio, Speech, and Language Processing. DOI: 10.1109/TASL.2011.2125954.

[4]       C.N. Van der Wal, W. Kowalczyk, “Detecting Changing Emotions in Human Speech by Machine and Humans”, Springer Science and Business Media, NY - Applied Intelligence, December 2013. DOI: 10.1007/s10489-013-0449-1.

[5]       B. Fergani, M. Davy, and A. Houacine, “Speaker diarization using one-class support vector machines,” Speech Communication, vol. 50, pp. 355-365, 2008. DOI: 10.1016/j.specom.2007.11.006.

[6]       F. Valente, “Variational Bayesian Methods for Audio Indexing,” PhD. dissertation, Universite de Nice-Sophia Antipolis, 2005. DOI: 10.1007/11677482_27.

[7]       P. Kenny, D. Reynolds, and F. Castaldo, “Diarization of telephone conversations using factor analysis,” Selected Topics in Signal Processing, IEEE Journal of, vol. 4, pp. 1059-1070, 2010.

[8]       S. Davis, P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Audio Speech Language Processing. 28, 357–366, 1980.

[9]       Wojtek Kowalczyk, C. Natalie van der Wal., “Detecting Changing Emotions in Natural Speech”, Springer Science Business Media New York, Appl Intell (2013) 39:675–691 DOI:10.1007/s10489-013-0449-1.

[10]       Sara Motamed, Saeed Setayeshi, “Speech Emotion Recognition Based on Learning Automata in Fuzzy Petri-net”, Journal of mathematics and computer science, vol. 12, August 2014.

[11]       Rahul B. Lanjewar, Swarup Mathurkar, Nilesh Patel, “Implementation and Comparison of Speech Emotion Recognition System using Gaussian Mixture Model (GMM) and K-Nearest Neighbor (K-NN) techniques,” In Procedia Computer Science 49 (2015) pp. 50-57, DOI: 10.1016@j.procs.2015.04.226, 2015.

[12]       J. H. Wolfe, “Pattern clustering by multivariate analysis,” Multivariate Behavioral Research, vol. 5, pp. 329-359, 1970.

[13]       D. Ververidis, C. Kotropoulos, “Emotional Speech Classification Using Gaussian Mixture Models and the Sequential Floating Forward Selection Algorithm,” IEEE International Conference on Multimedia and Expo, Amsterdam, 2005. DOI:10.1109/ICME.2005.1521717.

[14]       H. Farsaie Alaie, L. Abou-Abbas, C. Tadj, “Cry-based infant pathology classification using GMMs,” Speech Communication (2015), DOI: 10.1016/j.specom.2015.12.001, 2015.

[15]       Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B., “A database of german emotional speech”, INTERSPEECH, pp.1517–1520, 2005.

[16]       Yang, M.Lugger, “Emotion recognition from speech signals using new harmony features,” Special Section on Statistical Signal & Array Processing, vol. 90, Issue 5, May 2010, pp. 1415–1423, DOI: 10.1016/j.sigpro.2009.09.009.

[17]       A. Rabiee, S. Setayeshi, “Robust and optimum features for Persian accent classification using artificial neural network,” in the proceedings of the 19th international conference on Neural Information Processing - Volume Part IV. DOI: 10.1007/978-3-642-34478-7_54.

[18]       J. Kittler, “Feature set search algorithms,” Journal of Pattern Recognition and Signal Process, 1978, pp. 41–60.

[19]       R. Ashrafidoost, S. Setayeshi, “A Method for Modelling and Simulation the Changes Trend of Emotions in Human Speech”, In Proc. of 9th European Congress on Modelling and Simulation (Eurosim), Sep.2016, p.444-450, DOI:10.1109/EUROSIM.2016.30.

[20]       L. R. Welch, “Hidden Markov models and the Baum-Welch algorithm,” IEEE Information Theory Society Newsletter vol. 53, pp. 1, 10-13, Dec 2003.

[21]       B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine - belief network architecture,” in Proc. 2004 IEEE Int. Conf. Acoustics, Audio and Signal Processing, May 2004, vol. 1, pp. 577-580.

[22]       C. M. Lee, S. Narayanan, “Towards detecting emotion in spoken dialogs,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, 2005. DOI: 10.1016/j.specom.2010.08.013

[23]       A. B. Ingale, D. S. Chaudhari, “Speech Emotion Recognition, International Journal of Soft Computing and Engineering (IJSCE), Volume-2, Issue-1, March 2012.

[24]       A. S. Utane, S. L. Nalbalwar, “Emotion Recognition through Speech Using Gaussian Mixture Model and Support Vector Machine,” International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013.

[25]       R. B. Lanjewar, S. Mathurkar, N. Patel, “Implementation and Comparison of Speech Emotion Recognition System using Gaussian Mixture Model (GMM) and K-Nearest Neighbour (NKK) Techniques,” Elsevier, Procedia Computer Science, December 2015, DOI: 10.1016/j.procs.2015.04.226.

 

 

INSERT

 

Author’s

  Photo

 

First A. Author and the other authors should include biographies at the end of the paper. The first paragraph may contain a place and/or date of birth (list place, then date). Next, the author’s educational background is listed. The degrees should be listed with type of degree in what field, which institution, city, state, and country, and year degree was earned. The author’s major field of study should be lower-cased.

        The second paragraph uses the pronoun of the person (he or she) and not the author’s last name. It lists military and work experience, including summer and fellowship jobs. Job titles are capitalized. The current job must have a location; previous positions may be listed without one. Information concerning previous publications may be included. Try not to list more than three books or published articles. The format for listing publishers of a book within the biography is: title of book (city, state: publisher name, year) similar to a reference. Current and previous research interests end the paragraph.

        The third paragraph begins with the author’s title and last name (e.g., Dr. Smith, Prof. Jones, Mr. Kajor, Ms. Hunter). List any memberships in professional societies. Finally, list any awards and work for journal committees and publications. If a photograph is provided, the biography will be indented around it. The photograph is placed at the top left of the biography. Personal hobbies will be deleted from the biography.

 


[1]  

 



Related articles