MSDSA: Imbalanced Data Sentiment Analysis using Manifold Smoothness Satisfied Data
الموضوعات : Journal of Computer & Robotics
Shima Rashidi
1
,
Jarar Tanha
2
,
Arash Sharifi
3
,
Mehdi HoseinZadeh
4
1 - aDepartment of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran.
2 - bUniversity of Human Development, Sulaymaniyah, Kurdistan Region of Iraq.
3 - cFaculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran.
4 - DPattern Recognition and Machine Learning Lab, Gachon University, Seongnam, Republic of Korea.
الکلمات المفتاحية: Twitter Sentiment Analysis, Manifold Smoothness, SMOTE, XGBoost, BERT,
ملخص المقالة :
This paper proposes a new approach to imbalanced sentiment analysis. The main goal of sentiment analysis is to understand the attitudes and preferences of the user reviews. Recently, this research area has received more attention. In this paper, we focus on imbalanced data in sentiment analysis. The proposed method has three steps. First, we learn a discriminative representation of text tweets. To do so, we fine-tune the BERT model in a supervised manner using a proposed loss function based on manifold smoothness. In this case, the goal is to find a new representation in which each sample's local neighbors belong to the same class label. Second, using the new representation, the over-sampling of the minority class has been done. To do this, we have modified the SMOTE algorithm so that only samples that satisfy the manifold smoothness should be added to the generated sample set. Third, combining the original and over-sampled data, we learn the XGBoost algorithm as a final task predictor. To evaluate the proposed model, we have applied it to the SemEval-2017 Task4 dataset. We have done considerable experiments to show the effectiveness of the proposed method. The obtained results show the strength of the proposed approach.
[1] B. AlBadani, R. Shi, and J. Dong, "A novel machine learning approach for sentiment analysis on Twitter incorporating the universal language model fine-tuning and SVM," Applied System Innovation, vol. 5, no. 1, p. 13, 2022.
[2] I. K. Gupta, K. A. A. Rana, V. Gaur, K. Sagar, D. Sharma, and A. Alkhayyat, "Low-resource language information processing using dwarf mongoose optimization with deep learning based sentiment classification," ACM Transactions on Asian and Low-Resource Language Information Processing, 2023.
[3] P. Balage Filho, L. Avanço, T. Pardo, and M. d. G. V. Nunes, "NILC_USP: An improved hybrid system for sentiment analysis in twitter messages," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014, pp. 428-432.
[4] A. Tripathy, A. Anand, and V. Kadyan, "Sentiment classification of movie reviews using GA and NeuroGA," Multimedia Tools and Applications, vol. 82, no. 6, pp. 7991-8011, 2023.
[5] R. Gupta, "Data augmentation for low resource sentiment analysis using generative adversarial networks," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 7380-7384.
[6] I. Goodfellow et al., "Generative Adversarial Nets," in International Conference on Neural Information Processing Systems, 2014, pp. 2672–2680.
[7] L. S. Meetei, T. D. Singh, S. K. Borgohain, and S. Bandyopadhyay, "Low resource language specific pre-processing and features for sentiment analysis task. Language," Resources and Evaluation, vol. 55, no. 4, pp. 947-969, 2021.
[8] K. Ghosh, A. Banerjee, S. Chatterjee, and S. Sen, "Imbalanced twitter sentiment analysis using minority oversampling," IEEE 10th international conference on awareness science and technology (iCAST), pp. 1-5, 2019
[9] B. Krawczyk, B. T. McInnes, and A. Cano, "Sentiment classification from multi-class imbalanced twitter data using binarization," Hybrid Artificial Intelligent Systems: 12th International Conference, pp. 26-37, 2017.
[10] J. Ah-Pine and E. P. Soriano-Morales, "A study of synthetic oversampling for twitter imbalanced sentiment analysis," Workshop on interactions between data mining and natural language processing (DMNLP), 2016.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[12] T. Chen et al., "Xgboost: extreme gradient boosting," R package version 0.4-2, vol. 1, no. 4, pp. 1-4, 2015.
[13] F. Sebastiani, "An axiomatically derived measure for the evaluation of classification algorithms," in International Conference on The Theory of Information Retrieval, 2015 pp. 11–20.
[14] P. Nakov et al., "Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts," Language Resources and Evaluation, vol. 50, no. 1, pp. 35–65, 2016.
[15] M. Cliche, "BB twtr at SemEval-2017 Task 4: Twitter sentiment analysis with CNNs and LSTMs," International Workshop on Semantic Evaluations, pp. 573–580, 2017.
[16] C. Baziotis, N. Pelekis, and C. Doulkeridis, "DataStories at SemEval-2017 Task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis," in International Workshop on Semantic Evaluations, 2017, pp. 747–754.
[17] D. Q. Nguyen, T. Vu, and A. T. Nguyen, "BERTweet: A pre-trained language model for English Tweets," arXiv preprint arXiv:2005.10200, 2020.