Improving the rate of classification of large unbalanced data with deep learning algorithms
Subject Areas : تحقیق در عملیات
Shokoofa Mostofi
1
,
Sohrab Kordrostami
2
,
Amirhossein Refahi Sheikhani
3
,
Marzieh Faridi Masouleh
4
,
Soheil Shokri
5
1 - Department of Mathematics and Computer Sciences, Lahijan Branch, Islamic Azad University, Lahijan, Iran
2 - Department of Mathematics and Computer Sciences, Lahijan Branch, Islamic Azad University, Lahijan, Iran
3 - Department of Mathematics and Computer Sciences, Lahijan Branch, Islamic Azad University, Lahijan, Iran
4 - Faculty of Computer and Information Technology, Ahrar University, Rasht, Iran
5 - Department of Mathematics, Lahijan Branch, Islamic Azad University, Lahijan, Iran
Keywords: شبکه کانولوشن, یادگیری عمیق, دادههای نامتوازن, داده های حجیم, شبکه LSTM,
Abstract :
In the modern world, vast volumes of textual and unbalanced information has been transferred to the digital environment. On the other hand, analyzing large unbalanced data is a necessity in this environment. Textual data analysis has been proposed by machine learning techniques, intelligent data retrieval, natural language processing, or other related methods, but the accuracy of this data classification remains a problem. The purpose of this paper is to provide a system for improving the accuracy rate of large unbalanced data classifications. For this purpose, deep learning algorithms have been used to process data and generate features and finally perform classification. The data analyzed in this study include bulk textual data. This method involves a set of preprocessors to prepare the data and then use a model to generate embedded vectors. In this method, two types of deep networks are used: two-dimensional convolutional networks and LSTMnetworks. The results based on accuracy criteria show that the proposed two-dimensional networks on the textual data set achieve better results in terms of both criteria than the recursive networks. Also, the effect of normalization layers and production of embedded vectors has been studied and it has been observed that the importance of these layers is such that in some cases it can increase the classification accuracy byup to15%. Finally, the final model, which is a two-stream model of integrating the characteristics of two-dimensional and recursive networks, is examine. It is observed that this type of integration can improve the accuracy of the model byupto2.5%
[1] Jang, J., Kim, Y., Choi, K. and Suh, S., 2021. Sequential targeting: A continual learning approach for data imbalance in text classification. Expert Systems with Applications 179: 115067.
[2]Tarekegn, A., Giacobini, M. and Michalak, K., 2021. A Review of Methods for Imbalanced Multi-Label Classification. Pattern Recognition 118:107965.
[3]Luo, X., 2021. Efficient english text classification using selected machine learning techniques. Alexandria Engineering Journal: 60(3): 3401-3409.
[4]BaniAsadi, A. and Babaali, B., 2020. Power Quality Disturbances Classification Using Identity Feature Vector and Support Vector Machine. Journal of Soft Computing and Information Technology 9(2): 151-164.
[5]Golestanifar, B. and Chalechale, A., 2021. Determination of Mental States from Texts Using Evolutionary Imperialist Competitive Algorithm and Convolution Neural Networks. Journal of Soft Computing and Information Technology 10(1): 13-23.
[6]Xiao, Y., Li, Y., Yuan, J., Guo, S., Xiao, Y. and Li, Z., 2021. History-based attention in Seq2Seq model for multi-label text classification. Knowledge-Based Systems 224: p.107094.
[7]Bhumika, P.S.S.S. and Nayyar, P.A., 2013. A review paper on algorithms used for text classification. International Journal of Application or Innovation in Engineering & Management 3(2): 90-99.
[8]Singh, J.N. and Dwivedi, S.K., 2012. Analysis of vector space model in information retrieval. International Journal of Computer Application (IJCA):14-18.
[9]Ting, S.L., Ip, W.H. and Tsang, A.H., 2011. Is Naive Bayes a good classifier for document classification. International Journal of Software Engineering and Its Applications 5(3): 37-46.
[10]Kim, S.B., Han, K.S., Rim, H.C. and Myaeng, S.H., 2006. Some effective techniques for naive bayes text classification. IEEE transactions on knowledge and data engineering: 18(11): 1457-1466.
[11]Li, Z., Zhang, Y., Wei, Y., Wu, Y. and Yang, Q., 2017, August. End-to-End Adversarial Memory Network for Cross-domain Sentiment Classification. In IJCAI (pp. 2237-2243).
[12]Fang, W., Luo, H., Xu, S., Love, P.E., Lu, Z. and Ye, C., 2020. Automated text classification of near-misses from safety reports: An improved deep learning approach. Advanced Engineering Informatics 44: 101060.
[13]Chen, J., Huang, H., Tian, S. and Qu, Y., 2009. Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36(3): 5432-5435.
[14]Sun, A., Lim, E.P. and Liu, Y., 2009. On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems 48(1): 191-201.
[15]Thirumala, K., et al., 2019, A classification method for multiple power quality disturbances using EWT based adaptive filtering and multiclass SVM, Neurocomputing. 334: p. 265-274
[16]Goel, K., Vohra, R. and Bakshi, A., 2014, September. A novel feature selection and extraction technique for classification. In 2014 14th International Conference on Frontiers in Handwriting Recognition :104-109. IEEE.
[17]Chen, C. and Dai, J., 2021. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing 452: 253-262.
[18]Li, Y., Guo, H., Zhang, Q., Gu, M. and Yang, J., 2018. Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowledge-Based Systems 160: 1-15.
[19]Chen, Y.H., Zheng, Y.F., Pan, J.F. and Yang, N., 2013, November. A hybrid text classification method based on K-congener-nearest-neighbors and hypersphere support vector machine. In 2013 International Conference on Information Technology and Applications (pp. 493-497). IEEE.
[20]Cristian, P. and Elena, B.M., 2019. Dealing with Data Imbalance in Text Classification [J]. Procedia Computer Science 159: 736-745.
[21]Pop, I., 2006. An approach of the Naive Bayes classifier for the document classification. General Mathematics, 14(4): 135-138.
[22]Thabtah, F., Hammoud, S., Kamalov, F. and Gonsalves, A., 2020. Data imbalance in classification: Experimental evaluation. Information Sciences, 513: 429-441.
[23]Tsatsaronis, G. and Panagiotopoulou, V., 2009, April. A generalized vector space model for text retrieval based on semantic relatedness. In Proceedings of the Student Research Workshop at EACL 2009 (pp. 70-78).
[24]Atefeh BaniAsadi, bagher babaali.2020, Power Quality Disturbances Classification Using Identity Feature Vector and Support Vector Machine,Journal Of Soft Computing and Information Technology, pp. 151-164.
[25]Beniwal, R. K., Saini, M. K., Nayyar, A., Qureshi, B., & Aggarwal, A, 2021, A critical analysis of methodologies for detection and classification of power quality events in smart grid. IEEE Access, 9, 83507–83534.
[26]M. Buda et al. October 2018,A systematic study of the class imbalance problem in convolutional neural networks, Neural Networks,Volume 106, Pages 249-259.
[27]S.G. Burdisso et al., 2019,A text classification framework for simple and effective early depression detection over social media streams, Neural Networks, Volume 133, Expert Systems With Applications, Elsevier.