Bionic Wavelet Transform Entropy in Speaker-Independent and Context-Independent Emotional State Detection from Speech Signal
الموضوعات : مهندسی هوشمند برقMina Kadkhodaei Elyaderani 1 , Hamid Mahmoodian 2
1 - Najafabad Branch, Islamic Azad University, Najafabad, Iran
2 - Najafabad Branch, Islamic Azad University, Najafabad, Iran
الکلمات المفتاحية: Support vector machine, Feature Selection, Speech emotion recognition, Bionic wavelet transform entropy,
ملخص المقالة :
The most common way of communication between humans is the use of speech signals, which also includes the person's emotional states. Bionic wavelet transform entropy has been considered in this study for speaker-independent and context-independent emotion detection from speech. Bionic wavelet Transform decomposition, using wavelet type Morlet, is used after preprocessing and Shannon entropy in its nodes is calculated for feature selection. In addition, prosodic features such as the first four formants, jitter or pitch deviation amplitude, and shimmer or energy variation amplitude besides MFCC features are applied to complete the feature vector. Support Vector Machine (SVM) is used to classify multi-class samples of emotions. 46 different utterances of a single sentence from the Berlin emotional speech dataset are selected to be analyzed. The emotions that have been considered are sadness, happiness, fear, boredom, anger, and normal emotional state. Experimental results show that proposed features can improve emotional state detection accuracy in the multi-class situation.
2 International Journal of Smart Electrical Engineering, Vol.10, No.1, Winter 2021 ISSN: 2251-9246
EISSN: 2345-6221
pp. 1:6 |
Bionic Wavelet Transform Entropy in Speaker-Independent and Context-Independent Emotional State Detection from Speech Signal
Mina Kadkhodaei Elyaderani1,2, Seyed Hamid Mahmoodian*1,2
1 Department of Electrical Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Iran
2 Digital Processing and Machine Vision Research Center, Najafabad Branch, Islamic Azad University, Najafabad, Iran
Abstract
The most common way of communication between humans is the use of speech signals, which also includes the person's emotional states. Bionic wavelet transform entropy has been considered in this study for speaker-independent and context-independent emotion detection from speech. Bionic wavelet Transform decomposition, using wavelet type Morlet, is used after preprocessing and Shannon entropy in its nodes is calculated for feature selection. In addition, prosodic features such as the first four formants, jitter or pitch deviation amplitude, and shimmer or energy variation amplitude besides MFCC features are applied to complete the feature vector. Support vector machine (SVM) is used to classify multi-class samples of emotions. 46 different utterances of a single sentence from the Berlin emotional speech dataset are selected to be analyzed. The emotions that have been considered are sadness, happiness, fear, boredom, anger, and normal emotional state. Experimental results show that proposed features can improve emotional state detection accuracy in the multi-class situation.
Keywords: Bionic wavelet transform entropy, Feature selection, Speech emotion recognition, Support vector machine
Article history: Received 20-Apr-2021; Revised 01-May-2021; Accepted 15-May-2021.
© 2021 IAUCTB-IJSEE Science. All rights reserved
1. Introduction
Speech signals are the most common way of communication between humans [1,2]. Any speech that is stated also contains the emotional state of person. In addition to the literal meaning of speech, speech emotion recognition specifies more information for the listener [3,4]. Emotion recognition from speech and classifying it plays an important role in communication between humans and computers. With the rapid development of computers and emotions monitoring and due to recent improvement in recording, storing and processing of multimedia data, the need for these systems is increasing every day [5,6]. These systems could be used in applied computer programs, diagnostic tools for therapists, auto-answering centers, virtual training, Customer-oriented systems, computer games, mobile communications, driver emotions reporting and communication with operator, dialogue systems, disabled people communications with others and etc. However, despite extensive research, there are many problems in these systems. Human emotion is a complex, compound and ambiguous phenomena. Most often in a conversation between people, full, pure and basic emotions are not expressed directly [7,8].
The expression of emotions in speech depend on the culture and language, speech content, gender and age of speaker and many other factors [9], therefore methods of emotion detection systems in speaker-independent when an identical sentence is stated by different speakers and in state-independent when different sentences are stated by an identical speaker, is different. In these systems, some information should be extracted from the speech signal which have the maximum correlation with emotion while they are independent from other factors like speech content and the speaker. All of these issues complicate the processes of emotion detection from speech.
Generally, speech emotion recognition systems consist of two stages: Feature extraction and classification. The extraction stage is essential and selecting unsuitable features extremely decrease the performance of the classifier. The most widely used features in this field are the Mel-frequency cestrum coefficients (MFCC) and their derivatives, linear predictive coefficients (LPC), formants, fundamental frequency, amplitude, jitter or fundamental frequency ranges, shimmer, zero-cross rate, perceptual linear prediction (PLP) and etc. [10,11].
The wavelet transform has been introduced as a tool for non-stationary analysis and its use in speech processing has been increased recently [12,13]. Wavelet transform is decomposition of a function based on mother wavelet [14,15]. In previous works, the wavelet transform was introduced as an effective feature in emotion recognition from speech [16,17].
Bionic wavelet has been found as a new time- frequency method based on auditory model in recent years [18,19]. Bionic wavelet was built through a standard wavelet transform except that bionic wavelet is an active control mechanism based on human auditory model and it sets the wavelet transform based on analytical signal and it improves the resolution. Although this property is applicable in speech processing but it had not been used in the field of emotion recognition yet. The goal of this study was to implement this automotive tool.
Classification is an important stage which is done after feature extraction. In previous papers, some classifiers have been used such as Gaussian mixture model (GMM) [20], hidden Markova model (HMM) [21], neural network (NN) [22], support vector machine (SVM) [23] and etc.
In [24], noise-robust feature extraction (NRFE) procedure has been presented to compute the wavelet packet parameters. In that study, joint wavelet decomposition and autoregressive modelling have been applied. The final results had an improvement of 44.7% and 48.2% based on the MFCC front-end respectively.
In the other study [25], a mixed-signal processing algorithm has been proposed to reduce the cost of the analogy to digital converter and computational complexity digital back-end as well. Energy consumption was reduced to 0.72 μJ and processing speed increased to 45.79 μs per frame.
Adaptive sparse NMF feature selection and soft mask have been used to optimize DNN for speech enhancement [26]. Full account simplicity of SNMF and capturing of the salient structure of speech have been taken into account which was suitable for speech with different SNR.
In some cases, a combination of different methods has been used. SVM is one of supervised learning methods that is used for classification and regression calculations. Due to the suitable efficiency of this method in emotion detection systems, its applications have increased in classification of two-class and multi-class in recent years. SVM works based on linear classification of data. In linear data classification, a line is considered for data separation. However, to automatically classify data with high complexity, data should be transferred to a higher dimension space by an appropriate kernel. The kernels and their parameters are very important in SVM training. Therefore, in order to improve classification accuracy kernels should be chosen properly [27,28].
In this paper, support vector machine (SVM) has been used as a classifier. in addition to the common features previously used, the Bionic wavelet packet was proposed for feature extraction too. for this purpose, the energy in the wavelet tree node is used as an auxiliary feature. The performance of each feature vector for a certain emotion including sadness, anger, happiness and boredom compared to normal state with two-class classification was studied. Moreover, a multi-class classification was also tested on all emotional states. All tests were done using the emotional speech database of University of Berlin in both speaker-independent and context-independent states. This paper is structured as follow: in the next section we describe the features that have been used in our study. Section 2 deals with databased used, section 3 with classification, section 4 with methodology, finally we present obtained result in section 5 and conclude the paper in section 6.
2. Bionic wavelet transform
Based on an active auditory model, Jun yao proposed a new time frequency method, named bionic wavelet transform (BWT). BWT is distinguished from the standard wavelet transform (WT) in that the resolution in the time-frequency domain achieved by BWT can be adaptively adjusted not only by the signal frequency changes but also by the signal's instantaneous amplitude and its first-order differential [29,30]. this is the adaptability of mother wavelet that makes bionic wavelet compatible. In BWT, the active control mechanism based on the human auditory is modelled by adaptive mother wavelet. in General, the BWT ideas is that, envelope of mother wavelet, changes with time According to characteristics of the input signal, Equation of (1) shows the common mother wavelet used in continuous wavelet transform.
(1)
where is envelop of . To calculate wavelet, transform of the signal x(t) the following equation is used:
(2)
where a is scale and is time shift.
In order to simulate the active control function of the hair cells of the auditory system, a new parameter T is introduced into WT mother function resulting in the BWT mother function. T is related to the signal instantaneous amplitude and its first-order differential. Because of this new parameter, the mother envelope function in BWT can adaptively adjust according to the target signal properties and other parameter settings. So the mother function and the BWT of the analysed signal x(t) is represented as:
(3)
(4)
It is found that a linear relationship(equation), exists between BWT and WT when using Morlet mother wavelet. Therefore, to realize the fast implementation of continuous bionic wavelet transform the following equations can be used:
(5)
where K is:
(6)
BWT has been applied in speech signal processing as a rapid and reliable method
A. Feature selection
The speech features extracted from speech signal contain valuable information [31]. Commonly used speech features in speech recognition studies include formant, shimmer, jitter, linear predictive coefficients (LPC), linear prediction cepstral coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), first derivative of MFCC (D-MFCC), second derivative of MFCC (DDMFCC). Also log frequency power coefficients (LFPC), perceptual linear prediction (PLP), RelAtive SpecTrAl PLP (RastaPLP), log energy and zero crossing rate (ZCR) were used too.
The following features were extracted to train our system because our preliminary studies showed that these features gave better results.
B. Shimmer
A repeating variation in amplitude of the voice. It represents the relative period-to-period change of the peak-to-peak amplitude [32,33].
(7)
C) Jitter
It is defined as varying pitch in the voice, which shows the roughness of the sound. Jitter is the unwanted deflection from true periodicity of a supposedly periodic signal. It represents the relative period-to-period variability [34,35].
(8)
D) MFCC (Mel frequency cepstral coefficient)
Mel frequency cepstral coefficient is a parametric display which is widely used in the field of speech emotion recognition. MFCC is designed on the basis of human ear’s hearing system and uses nonlinear frequency units to simulate the human auditory system [36]. The following steps are done to calculated MFCC features. At first fast Fourier transform (FFT) of the signal is calculated [37]. After taking FFT, the power coefficients are calculated employing triangular band pass filter banks also known as Mel-scale filters. Mapping of linear frequency to Mel-frequency is as follows:
(9)
Finally, the log Mel spectrum is converted into time domain by DCT. Number of filter bank used were 26 but only the lower 13 coefficients of MFCCs are used for each sound wave sample.
2. Data
Emotional Speech Database (EMO-DB) from the popular studio recorded Berlin was used in our study [38]. This database consists of different emotions including anger, disgust, fear, joy, sadness and surprise, besides an exchange of surprise in favour of boredom and added neutrality. 46 different utterances of a single sentence from Berlin Emotional Speech Dataset are selected to train and test dataset. These are uttered by 10 speakers in sadness, happiness, fear, boredom, anger, and normal emotional state. The database is recorded in 16 bit, 16 kHz under studio noise free conditions.
3. Classification
SVM has been used in many applications of speech recognition. SVM is one of the supervised learning algorithms which has been used in the classification and regression. In the recent years, this comparatively new method is shown to outperform other older classifications ones such as NN. This method is used to recognize speech emotion state which had resulted in a very good performance. The classifier was trained using a radial basis Gaussian kernel and it was tested using a 2-fold cross validation strategy.
4. Experiments
Fig. 1 shows the structure of the proposed algorithm by using bionic entropy. First, for extraction of stationary features we should divide speech signal, which is a non-stationary signal, to frames of 20-100 ms. In this study, the minimum length of window was considered as 40 mili-seconds with a 50% overlap. this overlap is necessary to smooth changes in features, from one frame to the next. To minimize the impact of the edges on spectrum, hamming windows were used that cause focus on the information existed in the middle. Then by using Bionic filter banks which is composed of 22 bands, each frame of speech is analysed in 22 bands and Shannon entropy for each band is calculated by the following equation:
(10)
where e(i) is ith band entropy and sj is the j th bin of the histogram of the BWT coefficients of this band. Then by combining Bionic entropy and Mel cepstral coefficients, linear prediction coefficients, Formants, jitter and Shimmer, is described and finally support vector machine is used for classification.
5. Results
Tables 1 and 2 shows the results for different features for two class classification as well as multiclass classification in context independent and speaker independent respectively.
Table 1. Context independent test: recognition rate by using common and combinatory features Based on wavelet transform (context independent)
Database | Berlin Database | ||||||
Emotion state Feature | Anger-normal | Boredom-normal | Fear-normal | Happy-normal | Sad-normal | Multi class | |
MFCC | 91.42 | 65.71 | 87.14 | 91.42 | 81.42 | 70.00 | |
MFCC+ Formants | 80.00 | 93.67 | 95.00 | 75.00 | 94.89 | 68.33 | |
MFCC+ Jitter+ Shimmer | 85.00 | 91.42 | 85.00 | 80.00 | 90.00 | 65.00 | |
MFCC+ZCR | 85.00 | 90.00 | 85.00 | 75.00 | 91.42 | 71.22 | |
MFCC+ Formants+ Jitter+ Shimmer | 90.00 | 62.85 | 85.71 | 91.42 | 80.00 | 70.00 | |
BWT Entropy | 74.28 | 68.57 | 64.28 | 71.42 | 68.57 | 62.00 | |
BWTEntropy+ MFCC | 91.42 | 75.71 | 90.00 | 85.71 | 88,57 | 75.00 | |
BWT Entropy+ formants | 74.28 | 51.42 | 70.00 | 70.00 | 70.00 | 58.33 | |
BWT Entropy+ jitter+ shimmer | 65.71 | 55.71 | 65.00 | 75.00 | 65.00 | 55.00 | |
BWT Entropy+ formants+ jitter+ shimmer | 70.00 | 65.71 | 61.42 | 68.57 | 54.28 | 62.00 | |
BWT Entropy+ MFCC+ formants+ jitter+ shimmer | 95.00 | 75.71 | 85.71 | 92.85 | 88.57 | 71.66 |
Table 2. Speaker independent test: recognition rate by using common and combinatory features Based on wavelet transform (speaker independent)
Database | Berlin Database | |||||
Emotion state
Feature | anger | boredom | fear | happy | sad | Multi class |
MFCC | 81.42 | 70.00 | 76.14 | 71.42 | 75.71 | 66.66 |
MFCC+ Formants | 85.00 | 80.00 | 80.00 | 90.00 | 80.00 | 58.33 |
MFCC+ Jitter+ Shimmer | 85.00 | 78.00 | 75.00 | 75.00 | 80.00 | 65.00 |
MFCC+ZCR | 80.00 | 60.00 | 85.00 | 80.00 | 85.00 | 65.00 |
MFCC+ Formants+ Jitter+ Shimmer | 77.14 | 70.00 | 64.28 | 68.57 | 78.57 | 60.00 |
BWT Entropy | 84.28 | 65.71 | 67.14 | 61.42 | 75.71 | 66.66 |
BWT Entropy+ MFCC | 85.71 | 75.00 | 62.85 | 70.00 | 75.00 | 71.66 |
BWT Entropy+ formants | 80.00 | 65.71 | 52.85 | 64.28 | 74.28 | 72.00 |
BWT Entropy+ jitter+ shimmer | 90.00 | 65.71 | 57.14 | 58.57 | 68.57 | 62.00 |
BWT Entropy+ formants+ jitter+ shimmer | 81.42 | 65.71 | 54.28 | 72.85 | 72.85 | 65.00 |
BWT Entropy +MFCC+ formants+ jitter+ shimmer | 84.28 | 74.28 | 61.42 | 72.85 | 78.21 | 65.00 |
As seen in table 1 and 2 commonly used features gave the best recognition rate for two class classification but features based on BWT were the most discriminative for multiclass classifications.
6. Conclusion and discussion
To evaluate our results, studies that used the Berlin database were reviewed. Gaurav [39], used a combination of features such as Mel cepstrum, energy, jitter, Shimmer, zero-crossing rate and performed the classification with SVM and GMM in a multiclass state. The detection rate was 65% which is declined compared to the results obtained by using bionic wavelet entropy.
Yang and others [40], employed features such as energy, formant, zero crossing rate, and GMM classification and reported detection rate of 52.7%. The authors of the study [41], used prosodic features and support vector machine classifier which led to the detection rate of 67.7%. Researchers in [42], represented spectral features and various classification. They reported detection rates of 66.83% in multiclass state Hubner and others [43], used zero-crossing rate and did the classification with MLP neural network. detection rate of 61.42% was obtained in state of happiness. Hassan and others in [44], used spectral features and did multi class state classification with SVM and got detection rate of 63.2%. All these mentioned studies have a lower recognition rate Compared to our results. In our study, we found the most discriminative features for the speech emotion recognition system. We investigated and studied the effectiveness of each feature components that are the most discriminative. The results also show BWT Entropy had a good recognition rate on multiclass SVM classifier in speaker-independent and context-independent condition.
References
[1]. R. M. Simply, E. Dafna, Y. Zigel, "Diagnosis of obstructive sleep apnea using speech signals from awake subjects", IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 2, pp. 251-260, Feb. 2020.
[2]. Y. Pourebrahim, F. Razzazi, H. Sameti, "Speech emotion recognition using a combination of transformer and convolutional neural networks", Journal of Intelligent Procedures in Electrical Technology, vol. 13, no. 52, pp. 79-98, March 2023.
[3]. S. Chakrabartty, Y. Deng, G. Cauwenberghs, "Robust speech feature extraction by growth transformation in reproducing kernel hilbert space", IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, no. 6, pp. 1842-1849, Aug. 2007.
[4]. F. Faghani, H. Abutalebi, “Implementation of hybrid speech dereverberation systems and proposing dual microphone farsi database in order to evaluating enhancement systems”, Journal of Intelligent Procedures in Electrical Technology, vol. 3, no. 12, pp. 35-46, 2013.
[5]. M.B. Er, E. Isik, I. Isik, “Parkinson’s detection based on combined CNN and LSTM using enhanced speech signals with Variational mode decomposition”, Biomedical Signal Processing and Control, vol. 70, Article Number: 103006, Sept. 2021.
[6]. K. Gurugubelli, A.K. Vuppala, "Stable implementation of zero frequency filtering of speech signals for efficient epoch extraction", IEEE Signal Processing Letters, vol. 26, no. 9, pp. 1310-1314, Sept. 2019.
[7]. D.J. France, R.G. Shiavi, S. Silverman, M. Silverman, M. Wilkes, "Acoustical properties of speech as indicators of depression and suicidal risk", IEEE Trans. on Biomedical Engineering, vol. 47, no. 7, pp. 829-837, July 2000.
[8]. N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, T.F. Quatieri, “A review of depression and suicide risk assessment using speech analysis”, Speech Communication, vol. 71, pp. 10-49, July 2015,
[9]. S. Yildirim, Y. Kaya, F. Kılıç, “A modified feature selection method based on metaheuristic algorithms for speech emotion recognition”, Applied Acoustics, vol. 173, Article Number: 107721, Feb. 2021.
[10]. K. Yang, X. Zhou, "Unsupervised classification of hydrophone signals with an improved mel-frequency cepstral coefficient based on measured data analysis", IEEE Access, vol. 7, pp. 124937-124947, 2019.
[11]. R. Hassen, B. Gülecyüz, E. Steinbach, "PVC-SLP: perceptual vibrotactile-signal compression based-on sparse linear prediction", IEEE Trans. on Multimedia, vol. 23, pp. 4455-4468, 2021.
[12]. M. Rezayatmand, A. Naghsh, "A new robust and semi-blind digital image watermarking method based on DWT and SVD", Journal of Intelligent Procedures in Electrical Technology, vol. 13, no. 51, pp. 1-18, December 2022.
[13]. M. Karimi, H. Pourghassem, G. Shahgholian, "A novel prosthetic hand control approach based on genetic algorithm and wavelet transform features", Proceeding of the IEEE/CSPA, pp. 287-292, Penang, Malaysia, March 2011.
[14]. S.B. Emami, N. Nourafza, S. Fekri–Ershad, "A method for diagnosing of Alzheimer's disease using the brain emotional learning algorithm and wavelet feature", Journal of Intelligent Procedures in Electrical Technology, vol. 13, no. 52, pp. 65-78, March 2023.
[15]. M. Iyzadpanahi, M.R. Yousefi, N. Behzadfar, “Classification of upper limb movement imaginations based on a hybrid method of wavelet transform and principal component analysis for brain-computer interface applications”, Journal of Novel Researches on Electrical Power, vol. 9, no. 3, pp. 35-42, 2020.
[16]. M. Kadkhodaei Elyaderani, S.H. Mahmoodian, G. Sheikhi, "Wavelet packet entropy in speaker-independent emotional state detection from speech signal", Journal of Intelligent Procedures in Electrical Technology, vol. 5, no. 20, pp. 67-74, March 2015.
[17]. G. Sheikhi, H. Mahmoodian, “Syllable segmentation of farsi continuous speech using wavelet coefficients thresholding and fuzzy smoothing of energy contour”, Journal of Intelligent Procedures in Electrical Technology, vol. 4, no. 15, pp. 19-30, Dec. 2013.
[18]. A. Garg, O.P. Shau, A hybrid approach for speech enhancement using bionic wavelet transform and butterworth filter”, International Journal of Computers and Applications, vol. 42, no. 7, 2020.
[19]. F. Chen, Y.T. Zhang, “A new implementation of discrete bionic wavelet transform: Adaptive tiling”, Digital Signal Processing, vol. 16, no. 3, pp. 233-246, May 2006.
[20]. A.D. Dileep, C.C. Sekhar, "GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines", IEEE Trans. on Neural Networks and Learning Systems, vol. 25, no. 8, pp. 1421-1432, Aug. 2014.
[21]. A.D. Dileep, C.C. Sekhar, "HMM-based intermediate matching kernel for classification of sequential patterns of speech using support vector machines", IEEE Trans. on Audio, Speech, and Language Processing, vol. 21, no. 12, pp. 2570-2582, Dec. 2013.
[22]. N.Y.H. Wang et al., "Improving the intelligibility of speech for simulated electric and acoustic stimulation using fully convolutional neural networks", IEEE Trans. on Neural Systems and Rehabilitation Engineering, vol. 29, pp. 184-195, 2021.
[23]. G. Chau, G. Kemper, "One channel subvocal speech phrases recognition using cumulative residual entropy and support vector machines", IEEE Latin America Transactions, vol. 13, no. 7, pp. 2135-2143, July 2015.
[24]. B. Kotnik, Z. Kačič, “A noise robust feature extraction algorithm using joint wavelet packet subband decomposition and AR modeling of speech signals”, Signal Processing, vol. 87, no. 6, pp. 1202-1223, June 2007.
[25]. Q. Li et al., "MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications", IEEE Access, vol. 8, pp. 48720-48730, 2020.
[26]. H. Jia, W. Wang, S. Mei, “Combining adaptive sparse NMF feature extraction and soft mask to optimize DNN for speech enhancement”, Applied Acoustics, vol. 171, Article Number: 107666, Jan. 2021.
[27]. S. Chandaka, A. Chatterjee, S. Munshi, “Support vector machines employing cross-correlation for emotional speech recognition”, Measurement, vol. 42, no. 4, pp. 611-618, May 2009.
[28]. M. Matsumoto, J. Hori, “Classification of silent speech using support vector machine and relevance vector machine”, Applied Soft Computing, vol. 20, pp. 95-102, July 2014.
[29]. J. Yao, Y.T. Zhang, "The application of bionic wavelet transform to speech signal processing in cochlear implants using neural network simulations", IEEE Trans. on Biomedical Engineering, vol. 49, no. 11, pp. 1299-1309, Nov. 2002.
[30]. J. Yao, Y.T. Zhang, "Bionic wavelet transform: a new time-frequency method based on an auditory model", IEEE Trans. on Biomedical Engineering, vol. 48, no. 8, pp. 856-863, Aug. 2001.
[31]. Y. Hu, Z. Ling, "Extracting spectral features using deep autoencoders with binary distributed hidden units for statistical parametric speech synthesis", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 713-724, April 2018.
[32]. Y. Bennane, A. Kacha, J. Schoentgen, F. Grenez, "Synthesis of pathological voices and experiments on the effect of jitter and shimmer in voice quality perception", Proceeding of the IEEE/ICEE-B, pp. 1-6, Boumerdes, Algeria, Oct. 2017.
[33]. S.S. Upadhya, A.N. Cheeran, J.H. Nirmal, "Statistical comparison of Jitter and Shimmer voice features for healthy and Parkinson affected persons", Proceeding of the IEEE/ICECCT, pp. 1-6, Coimbatore, India, Feb. 2017.
[34]. J.M. Miramont, M.A. Colominas, G. Schlotthauer, "Voice Jitter estimation using high-order synchrosqueezing operators", IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 527-536, 2021.
[35]. J. Suarez, M. Salcedo, C. Carmona, J. Ramirez, G. Serna, "Effects of IPv6-IPv4 tunnel in Jitter of voice over IPv6, measured in laboratory and over the National Research and Education Network of Colombia “RENATA”", IEEE Latin America Transactions, vol. 14, no. 3, pp. 1380-1386, March 2016.
[36]. Y. Shi, J. Bai, P. Xue, D. Shi, "Fusion feature extraction based on auditory and energy for noise-robust speech recognition", IEEE Access, vol. 7, pp. 81911-81922, 2019.
[37]. M. Manoochehri, H. Pourghassem, G. Shahgholian, "A novel synthetic image watermarking algorithm based on Discrete Wavelet Transform and Fourier-mellin transform", Proceeding of the IEEE/ICCSN, pp. 265-269, Xi'an, China, May 2011.
[38]. F. Burkhardt, and et al, “A database of german emotional speech”, Proceedings of the Interspeech, Lissabon, Portugal, pp.1517-1520, 2005.
[39]. M. Gaurav, "Performance analysis of spectral and prosodic features and their fusion for emotion recognition in speech", Proceeding of the IEEE/SLT, pp. 313-316, Goa, India, Dec. 2008.
[40]. B. Yang, M. Lugger, “Emotion recognition from speech signals using new harmony features”, Signal Processing, vol. 90, no. 5, pp. 1415-1423, May 2010.
[41]. D. Bitouk, R. Verma, A. Nenkova, “Class-level spectral features for emotion recognition”, Speech Communication, vol. 52, no. 7–8, pp. 613-625, July/Aug. 2010.
[42]. E.M. Albornoz, D.H. Milone, H.L. Rufiner, “Spoken emotion recognition using hierarchical classifiers”, Computer Speech & Language, vol. 25, no. 3, pp. 556-570, July 2011.
[43]. D. Philippou-Hübner, B. Vlasenko, R. Böck, A. Wendemuth, "The performance of the speaking rate parameter in emotion recognition from speech", Proceeding of the IEE/ICME, pp. 248-253, Melbourne, VIC, Australia, July 2012.
[44]. A. Hassan, R.I. Damper, “Classification of emotional speech using 3DEC hierarchical classifier”, Speech Communication, vol. 54, no. 7, pp. 903-916, Sept. 2012.