MSDSA: Imbalanced Data Sentiment Analysis using Manifold Smoothness Satisfied Data
Subject Areas : Journal of Computer & Robotics
Shima Rashidi
1
,
Jarar Tanha
2
*
,
Arash Sharifi
3
,
Mehdi HoseinZadeh
4
1 - aDepartment of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran.
2 - bUniversity of Human Development, Sulaymaniyah, Kurdistan Region of Iraq.
3 - cFaculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran.
4 - DPattern Recognition and Machine Learning Lab, Gachon University, Seongnam, Republic of Korea.
Keywords: Twitter Sentiment Analysis, Manifold Smoothness, SMOTE, XGBoost, BERT,
Abstract :
This paper proposes a new approach to imbalanced sentiment analysis. The main goal of sentiment analysis is to understand the attitudes and preferences of the user reviews. Recently, this research area has received more attention. In this paper, we focus on imbalanced data in sentiment analysis. The proposed method has three steps. First, we learn a discriminative representation of text tweets. To do so, we fine-tune the BERT model in a supervised manner using a proposed loss function based on manifold smoothness. In this case, the goal is to find a new representation in which each sample's local neighbors belong to the same class label. Second, using the new representation, the over-sampling of the minority class has been done. To do this, we have modified the SMOTE algorithm so that only samples that satisfy the manifold smoothness should be added to the generated sample set. Third, combining the original and over-sampled data, we learn the XGBoost algorithm as a final task predictor. To evaluate the proposed model, we have applied it to the SemEval-2017 Task4 dataset. We have done considerable experiments to show the effectiveness of the proposed method. The obtained results show the strength of the proposed approach.
[1] B. AlBadani, R. Shi, and J. Dong, "A novel machine learning approach for sentiment analysis on Twitter incorporating the universal language model fine-tuning and SVM," Applied System Innovation, vol. 5, no. 1, p. 13, 2022.
[2] I. K. Gupta, K. A. A. Rana, V. Gaur, K. Sagar, D. Sharma, and A. Alkhayyat, "Low-resource language information processing using dwarf mongoose optimization with deep learning based sentiment classification," ACM Transactions on Asian and Low-Resource Language Information Processing, 2023.
[3] P. Balage Filho, L. Avanço, T. Pardo, and M. d. G. V. Nunes, "NILC_USP: An improved hybrid system for sentiment analysis in twitter messages," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014, pp. 428-432.
[4] A. Tripathy, A. Anand, and V. Kadyan, "Sentiment classification of movie reviews using GA and NeuroGA," Multimedia Tools and Applications, vol. 82, no. 6, pp. 7991-8011, 2023.
[5] R. Gupta, "Data augmentation for low resource sentiment analysis using generative adversarial networks," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 7380-7384.
[6] I. Goodfellow et al., "Generative Adversarial Nets," in International Conference on Neural Information Processing Systems, 2014, pp. 2672–2680.
[7] L. S. Meetei, T. D. Singh, S. K. Borgohain, and S. Bandyopadhyay, "Low resource language specific pre-processing and features for sentiment analysis task. Language," Resources and Evaluation, vol. 55, no. 4, pp. 947-969, 2021.
[8] K. Ghosh, A. Banerjee, S. Chatterjee, and S. Sen, "Imbalanced twitter sentiment analysis using minority oversampling," IEEE 10th international conference on awareness science and technology (iCAST), pp. 1-5, 2019
[9] B. Krawczyk, B. T. McInnes, and A. Cano, "Sentiment classification from multi-class imbalanced twitter data using binarization," Hybrid Artificial Intelligent Systems: 12th International Conference, pp. 26-37, 2017.
[10] J. Ah-Pine and E. P. Soriano-Morales, "A study of synthetic oversampling for twitter imbalanced sentiment analysis," Workshop on interactions between data mining and natural language processing (DMNLP), 2016.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[12] T. Chen et al., "Xgboost: extreme gradient boosting," R package version 0.4-2, vol. 1, no. 4, pp. 1-4, 2015.
[13] F. Sebastiani, "An axiomatically derived measure for the evaluation of classification algorithms," in International Conference on The Theory of Information Retrieval, 2015 pp. 11–20.
[14] P. Nakov et al., "Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts," Language Resources and Evaluation, vol. 50, no. 1, pp. 35–65, 2016.
[15] M. Cliche, "BB twtr at SemEval-2017 Task 4: Twitter sentiment analysis with CNNs and LSTMs," International Workshop on Semantic Evaluations, pp. 573–580, 2017.
[16] C. Baziotis, N. Pelekis, and C. Doulkeridis, "DataStories at SemEval-2017 Task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis," in International Workshop on Semantic Evaluations, 2017, pp. 747–754.
[17] D. Q. Nguyen, T. Vu, and A. T. Nguyen, "BERTweet: A pre-trained language model for English Tweets," arXiv preprint arXiv:2005.10200, 2020.
Journal of Computer & Robotics 18 (1), Winter and Spring 2025, 25-32
MSDSA: Imbalanced Data Sentiment Analysis using Manifold Smoothness Satisfied Data
Shima Rashidi a.b, , Jafar Tanha c, *, Arash Sharifi a, Mehdi HosseinzadehD
aDepartment of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran.
bUniversity of Human Development, Sulaymaniyah, Kurdistan Region of Iraq.
cFaculty of Electrical and Computer Engineering, University of Tabriz, Tabriz, Iran.
DPattern Recognition and Machine Learning Lab, Gachon University, Seongnam, Republic of Korea.
Received 29 February 2024, Accepted 01 May 2024
Abstract
This paper proposes a new approach to imbalanced sentiment analysis. The main goal of sentiment analysis is to understand the attitudes and preferences of the user reviews. Recently, this research area has received more attention. In this paper, we focus on imbalanced data in sentiment analysis. The proposed method has three steps. First, we learn a discriminative representation of text tweets. To do so, we fine-tune the BERT model in a supervised manner using a proposed loss function based on manifold smoothness. In this case, the goal is to find a new representation in which each sample's local neighbors belong to the same class label. Second, using the new representation, the over-sampling of the minority class has been done. To do this, we have modified the SMOTE algorithm so that only samples that satisfy the manifold smoothness should be added to the generated sample set. Third, combining the original and over-sampled data, we learn the XGBoost algorithm as a final task predictor. To evaluate the proposed model, we have applied it to the SemEval-2017 Task4 dataset. We have done considerable experiments to show the effectiveness of the proposed method. The obtained results show the strength of the proposed approach.
Keywords: Twitter Sentiment Analysis, Imbalanced Data, Manifold Smoothness, XGBoost, BERT.
1. Introduction
Twitter sentiment detector is one of the most important research areas, and its goal is to evaluate the quality of a product or service. Until now, many computational-based approaches have been introduced. In this case, a text tweet is fed into the model, and then the model assigns a sentiment label. This system could provide useful knowledge for other research areas like recommender systems and financial forecasting [1-3].
Recently, deep learning methods have gotten the most attention in sentiment analysis. The reason is that deep neural networks successfully extract knowledge from raw text tweets [1, 4]. However, the classic machine learning approaches perform successfully in task prediction. This area has many challenges, like low data resources and imbalanced data. In low-resource data, some approaches utilize augmentation techniques or generate new data to cope with this challenge. In [5], a generative adversarial network (GAN) [6] is utilized as a data augmentation technique. Meetei et al. [7] introduced a low-data sentiment analysis in which preprocessing techniques are used to generate additional linguistic features to deal with the low data resource.
As mentioned, one of the other challenges in this area is that the sample size of the classes is different. In other words, the dataset is imbalanced. This challenge has sparked interest in recent years. Gosh et al. [8] introduced a method for Twitter sentiment Analysis in which the imbalanced data is investigated. To do so, they have utilized the minority oversampling technique. Krawczyk et al. [9] introduced a model for sentiment analysis in Twitter data in which multi-class imbalance is investigated. They presented a framework with three steps: First, they decomposed multiple classes into many pairs of binary classes using the one vs. one technique. Then, for each pair of classes, they reduced the dimensionality of the data using Multiple Correspondence Analysis. Next, they preprocessed each pair of classes and learned a model for each pair of classes. The final model is the weighted average of the learned binary model. Ah-Pine & Soriano-Morales [10] utilized the syntactic oversampling technique to cope with the imbalanced dataset.
This paper focuses on imbalanced data and proposes a new method to cope with this challenge. The overall schematic of the proposed approach is given in Figure 1. As it is shown, the proposed method has three main steps. In the first step, we learn a discriminative feature descriptor for each tweet. In this case, we have used a pre-trained language model and added some layers on top of the pre-trained model. We have defined a new loss function based on manifold smoothness to learn a more discriminative feature distribution. Then, we have proposed a manifold smoothness SMOTE algorithm, which generates new samples for the minority class. Finally, we use the augmented data to learn the XGBoost algorithm as a task predictor.
To recap, the main contributions of this paper are as follows:
Proposing a new loss function to train language model based on manifold smoothness
Proposing a new version of SMOTE, which generates new samples based on manifold smoothness called MS_SMOTE.
Propose a framework to utilize them in sentiment analysis.
This paper is organized as follows. Section 2 gives a detailed explanation of the proposed method. The experiment and the results are presented in section 3. The advantages and shortcomings of the proposed method are discussed in section 4.
2. Proposed Method
In this section, the proposed approach is given. In Figure 1, the overall schematic of the proposed method is given. The proposed method has three subnetworks, which this section explains in detail.
Problem Formulation
Given shows the whole training set where
shows the ith tweet and
shows its corresponding label. It is assumed that
{positive, neutral, negative}. Also,
shows the number of training samples. The number of samples in the classes is imbalanced in the training set. The main goal of this paper is to design a new approach for sentiment analysis of these tweets in the presence of imbalanced data. In this paper, the imbalanced ratio is defined as the fraction of the number of positive data to the remaining ones.
The proposed method has three steps:
1) learning feature representation to ensure the manifold smoothness.
2) oversampling the minority class by proposing SMOTE+ manifold smoothness.
3) learning a boosting algorithm based on the XGBoost. In the following, we explain each of these steps in detail.
Feature Representation
In this step, we embed each tweet into a discriminative feature vector. Word embedding is a crucial step in analyzing text data in natural language processing. Several powerful networks have recently been developed for this purpose, including large language-based (LLM) models like BERT. Our study leverages the BERT model due to its superior performance compared to other models. It's worth mentioning that ChatGPT has gained significant attention recently. Although both ChatGPT and BERT rely on deep learning techniques and massive amounts of unlabeled data, but their architectures and training objectives differ. The main factor behind our choice of BERT is its tailoring for specific tasks, such as sentiment analysis, whereas ChatGPT is designed for conversational AI.
Figure 1- The overall schematic of the proposed method. (a) learn a tweet feature extractor using large language models. The network is fine-tuned using the proposed manifold smoothness-based loss. (b) generate samples for the minority class using the oversampling technique. We have proposed to modify the SMOTE algorithm by incorporating manifold smoothness. (c) using the original data and the generated samples, learn the XGBoost algorithm.
The overall schematic of the subnetwork of this step is shown in Figure 1. As shown, we have used the pre-trained BERT model and then added an MLP on top of it. Then, we fine-tune the model using our data. It should be noted that the layers of BERT are frozen, and only MLP layers are tuned. To train the network, we have defined a new loss function that tries to ensure the manifold smoothness. Hence, it is defined as follows:
| (1) |
| (2) |
| (3) |
| (4) |
Approach | AvgRec |
| Accuracy |
v1 | 69.3 | 61.8 | 70.1 |
v2 | 72.2 | 50.5 | 63.5 |
Our approach | 74.9 | 73.0 | 76.2 |
3.2 Comparison with SOTA
In this section, we have compared the proposed method with baselines and show how it works. We have chosen SVM, Naïve Bayes, Random Forest, and XGBoost as baselines to do so. In this experiment, we have assumed that we have two labels {positive, negative}, and we have set the labeled ratio to different values {0.01, 0.05, 0.1, 0.2, 0.4, 0.6}. In this case, 0.01 means that only 1% of the positive samples in the training set are used to train the model. The other ratios are defined similarly. It should be noted that in this experiment, all positive samples are considered as minority class, and the neutral and the negative samples are considered as majority class. The obtained results are shown in Table 2. As shown, the proposed method performs significantly better than the other methods. It means that the proposed method is effective in handling imbalanced data.
Table 2- The impact of the imbalanced ratio on SemEval-2017 Task 4. The results are compared with base approaches.
Approach |
| AvgRec |
| Accuracy |
SVM | 1% | 0.494 | 0.497 | 0.988 |
5% | 0.471 | 0.485 | 0.943 | |
10% | 0.446 | 0.471 | 0.892 | |
20% | 0.413 | 0.452 | 0.826 | |
40% | 0.816 | 0.498 | 0.832 | |
60% | 0.693 | 0.535 | 0.829 | |
Naïve Bayes | 1% | 0.494 | 0.497 | 0.988 |
5% | 0.541 | 0.498 | 0.682 | |
10% | 0.574 | 0.530 | 0.636 | |
20% | 0.595 | 0.559 | 0.617 | |
40% | 0.592 | 0.553 | 0.609 | |
60% | 0.587 | 0.543 | 0.596 | |
Random Forest | 1% | 0.494 | 0.497 | 0.988 |
5% | 0.471 | 0.485 | 0.943 | |
10% | 0.446 | 0.471 | 0.892 | |
20% | 0.413 | 0.452 | 0.823 | |
40% | 0.413 | 0.452 | 0.826 | |
60% | 0.413 | 0.452 | 0.826 | |
XGBoost | 1% | 0.494 | 0.494 | 0.975 |
5% | 0.616 | 0.546 | 0.936 | |
10% | 0.561 | 0.523 | 0.868 | |
20% | 0.556 | 0.520 | 0.796 | |
40% | 0.601 | 0.572 | 0.797 | |
60% | 0.624 | 0.595 | 0.803 | |
Our Approach | 1% | 0.559 | 0.503 | 0.935 |
5% | 0.597 | 0.552 | 0.926 | |
10% | 0.649 | 0.567 | 0.897 | |
20% | 0.663 | 0.581 | 0.798 | |
40% | 0.710 | 0.599 | 0.765 | |
60% | 0.722 | 0.603 | 0.828 |
As it is shown in Table 2, the proposed method generally outperforms the other approaches. Among the base approaches, the random forest has the worst result, and XGBoost could perform better than SVM, Random Forest, and NaiveBayes.
Also, to compare the proposed method with the state-of-the-art approaches, we used the same training and test sets (i.e., standard split) and then learned the model. The obtained results are given in Table 3. As shown, the proposed method performs better than the other comparing approaches. We increase the AvgRec measure by 1.7% compared to the best-comparing approach.
Table 3- The comparison of the proposed method with the recent successful approaches.
Approach | AvgRec |
| Accuracy |
XGBoost | 58.6 | 57.5 | 58.6 |
BB_twtr [15] | 68.1 | 68.5 | 65.8 |
DataStories [16] | 68.1 | 67.7 | 65.1 |
BERTweet [17] | 73.2 | 72.8 | 71.7 |
Our approach | 74.9 | 73.0 | 76.2 |
4. Conclusion
In this paper, we have proposed a new approach for imbalanced data in sentiment analysis. The proposed method introduces a unique framework in which the tweet representation is learned using a deep-learning-based BETR model. In this case, we propose a new loss function based on manifold smoothness, which aims to learn a discriminative representation of the samples. Then, we oversample the minority class using the new modification of the SMOTE algorithm. In this new modification, we only accept those generated samples that satisfy the manifold smoothness. Finally, the original data and the generated samples are fed into the XGBoost algorithm to learn a sentiment predictor model.
One of the main advantages of the proposed method that leads the model to better performance is the learned discriminative representation of the samples. This representation helps the model train a stronger and more generalizable task predictor.
One caveat to this approach is that optimizing might be sensitive to batch generation. As explained, the proposed loss function is based on the manifold smoothness. To check it properly, we should generate proper samples. Also, the size of the local neighbor is important in satisfying manifold smoothness in steps one and two. In future work, we want to use the differentiable version of XGBoost to design an end-to-end framework.
Competing interests
The authors declare no competing financial interest.
Authors contribution statement
SR: conceptualization, data curation, result analysis, methodology, writing, review & editing. ASH&MH: result analysis, project administration, review & editing. JT: conceptualization, supervision, project administration, review & editing.
References
[1] B. AlBadani, R. Shi, and J. Dong, "A novel machine learning approach for sentiment analysis on Twitter incorporating the universal language model fine-tuning and SVM," Applied System Innovation, vol. 5, no. 1, p. 13, 2022.
[2] I. K. Gupta, K. A. A. Rana, V. Gaur, K. Sagar, D. Sharma, and A. Alkhayyat, "Low-resource language information processing using dwarf mongoose optimization with deep learning based sentiment classification," ACM Transactions on Asian and Low-Resource Language Information Processing, 2023.
[3] P. Balage Filho, L. Avanço, T. Pardo, and M. d. G. V. Nunes, "NILC_USP: An improved hybrid system for sentiment analysis in twitter messages," in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014, pp. 428-432.
[4] A. Tripathy, A. Anand, and V. Kadyan, "Sentiment classification of movie reviews using GA and NeuroGA," Multimedia Tools and Applications, vol. 82, no. 6, pp. 7991-8011, 2023.
[5] R. Gupta, "Data augmentation for low resource sentiment analysis using generative adversarial networks," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 7380-7384.
[6] I. Goodfellow et al., "Generative Adversarial Nets," in International Conference on Neural Information Processing Systems, 2014, pp. 2672–2680.
[7] L. S. Meetei, T. D. Singh, S. K. Borgohain, and S. Bandyopadhyay, "Low resource language specific preprocessing and features for sentiment analysis task. Language," Resources and Evaluation, vol. 55, no. 4, pp. 947-969, 2021.
[8] K. Ghosh, A. Banerjee, S. Chatterjee, and S. Sen, "Imbalanced twitter sentiment analysis using minority oversampling," IEEE 10th international conference on awareness science and technology (iCAST), pp. 1-5, 2019
[9] B. Krawczyk, B. T. McInnes, and A. Cano, "Sentiment classification from multi-class imbalanced twitter data using binarization," Hybrid Artificial Intelligent Systems: 12th International Conference, pp. 26-37, 2017.
[10] J. Ah-Pine and E. P. Soriano-Morales, "A study of synthetic oversampling for twitter imbalanced sentiment analysis," Workshop on interactions between data mining and natural language processing (DMNLP), 2016.
[11] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
[12] T. Chen et al., "Xgboost: extreme gradient boosting," R package version 0.4-2, vol. 1, no. 4, pp. 1-4, 2015.
[13] F. Sebastiani, "An axiomatically derived measure for the evaluation of classification algorithms," in International Conference on The Theory of Information Retrieval, 2015 pp. 11–20.
[14] P. Nakov et al., "Developing a successful SemEval task in sentiment analysis of Twitter and other social media texts," Language Resources and Evaluation, vol. 50, no. 1, pp. 35–65, 2016.
[15] M. Cliche, "BB twtr at SemEval-2017 Task 4: Twitter sentiment analysis with CNNs and LSTMs," International Workshop on Semantic Evaluations, pp. 573–580, 2017.
[16] C. Baziotis, N. Pelekis, and C. Doulkeridis, "DataStories at SemEval-2017 Task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis," in International Workshop on Semantic Evaluations, 2017, pp. 747–754.
[17] D. Q. Nguyen, T. Vu, and A. T. Nguyen, "BERTweet: A pre-trained language model for English Tweets," arXiv preprint arXiv:2005.10200, 2020.
Related articles
-
Determining COVID-19 Tweet Check-Worthiness: Based On Deep Learning Approach
Print Date : 2023-01-01 -
-
A New Approach to Improve Tracking Performance of Moving Objects with Partial Occlusion.
Print Date : 2019-06-01 -
Application of Numerical Iterative Methods for Solving Benjamin-Bona-Mahony Equation
Print Date : 2019-12-01 -
The rights to this website are owned by the Raimag Press Management System.
Copyright © 2021-2025