Fake News Detection Using Feature Extraction, Resampling Methods, and Deep Learning
محورهای موضوعی : Computer EngineeringMirmorsal Madani 1 , Homayun Motameni 2 , Hosein Mohamadi 3
1 - Department of Computer Engineering, Sari Branch, Islamic Azad university, Sari, Iran
2 - Department of Computer Engineering, Sari Branch, Islamic Azad University, Sari, Iran
3 - Department of Computer Engineering, Azadshahr Branch, Islamic Azad University, Azadshahr, Iran
کلید واژه: Feature extraction, deep learning, Resampling, Fake news, Imbalanced classification,
چکیده مقاله :
The production of fake news were practiced even before the advent of the internet. However, with the development of the internet and traditional media giving way to social media, the growing and unstoppable process of making and spreading this kind of news have become a widespread concern. Fake news by disrupting the proper flow of information and deluding public opinion, potentially causes serious problems in society. Therefore, it is necessary to detect such news, which is associated with some challenges. These challenges may be related to various issues such as datasets, events, or audiences. Lack of sufficient information about news samples, or an imbalance are the main problems in some of these datasets, which will be addressed in this paper. In the proposed model, firstly the key features in relevant datasets will be extracted to increase information about news samples. After that, using the K-nearest neighbors, a genetic, and TomekLink algorithms as the cleaning techniques, as well as designing a Generative Adversarial network, as a technique for generating synthetic data, three novel methods in the area of hybrid resampling will be presented to balance these datasets. The presented methods cause a significant increase in the performance of the deep learning algorithms to detect fake news.
[1] Desuky A.S, Hussain S (2021) an Improved Hybrid Approach for Handling Class Imbalance Problem. Arab J SciEng 46, 3853–3864(2021). https://doi.org/10.1007/s13369-021-05347-7
[2] ChenY, Conory N, Rubin.V (2015) News in an Online World: The Need for an Automatic Crap Detector ASIST '15: Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community November 2015 Article No.: 81 Pages 1–4
[3]Shrestha, A., Spezzano, F. Characterizing and predicting fake news spreaders in social networks. Int J Data Sci Anal (2021). https://doi.org/10.1007/s41060-021-00291-z
[4] Zhang X, Ghorbani AA (2019) An overview of online fake news: Characterization, detection, and discussion, Information Processing & Management, Volume 57, Issue 2,2020,102025,ISSN:0306 4573,https://doi.org/10.1016/j.ipm.2019.03.004 (https://www.sciencedirect.com/science/article/pii/S0306457318306794)
[5] Figueira Á, Oliveira L (2017) the current state of fake news: challenges and opportunities. Procedia Computer Science, Volume 121, 2017, Pages 817-825, ISSN 1877-0509, https: //doi.org/10.1016/j.procs.2017.11.106. (https://www.sciencedirect.com/science/article/pii/S1877050917323086)
[6] Fenglian Li, Xueying Zhang, Xiqian Zhang, Chunlei Du, Yue Xu, Yu-Chu Tian (2018) Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets, Information Sciences, Volume 422, 2018, Pages 242-256, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2017.09.013. (https://www.sciencedirect.com/science/article/pii/S0020025517304784)
[7] Zhou X, Jain A, Phoha VV, Zafarani R (2019) Fake News Early Detection: A Theory-driven Model. arXiv preprint arXiv: 1904.11679
[8] McIntire G (2018) Fake and Real News Dataset. [Online], Available: https://github.com/GeorgeMcIntire/fake_real_news dataset, July 10, 2018
[9] Wang WY (2017) Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (p. 422426)
[10] Kaliyar R.K, Goswami A, Narang P (2021) DeepFakE: improving fake news detection using tensor decomposition-based deep neural network. J Supercomputing 77, 1015–1037. https://doi.org/10.1007/s11227-020-03294-y
[11] Shu K, Mahudeswaran D, Wang SH, Lee D, Liu H (2018) FakeNewsNet: A Data Repository with News Content, Social Context and Spatial temporal Information for Studying Fake News on Social Media [Online], Available: https://arxiv.org/abs/1809.01286, December 15, 2018
[12] Stefanowski J. (2016) Dealing with Data Difficulty Factors While Learning from Imbalanced Data. In: Matwin S., Mielniczuk J. (eds) Challenges in Computational Statistics and Data Mining. Studies in Computational Intelligence, vol 605. Springer, Cham. https://doi.org/10.1007/978-3-319-18781-5_17
[13] Michał K, Potential (2021) Anchoring for imbalanced data classification, Pattern Recognition, Volume 120, 2021, 108114, ISSN 0031-3203, https://doi.org/10.1016/j.patcog.2021.108114.
[14] Chawla N.V, Bowyer K. W, Hall L. O, Kegelmeyer W. P (2002) SMOTE: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.
[15] Maria P, Pedro Antonio G, Peter T, Cesar H (2016) Oversampling the minority class in the feature space, IEEE Trans. Neural Netw. Learning Syst. 27 (9) 1947–1961.
[16] Bellinger, C, Drummond, C, Japkowicz, N (2018). Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107, 605–637.https://doi.org/10.1007/s10994-017-5670-4
[17] Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C. (2009) Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem. In: Theeramunkong T., Kijsirikul B., Cercone N., Ho TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science, vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_43
[18] He, Haibo & Bai, Yang, Garcia, Edwardo, Li, Shutao. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of the International Joint Conference on Neural Networks. 1322 - 1328. 10.1109/IJCNN.2008.4633969.
[19] Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang DS, Zhang XP, Huang GB. (eds) Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_91
[20] Maciejewski, Tomasz, Stefanowski, Jerzy. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. Proceeding of the IEEE symposium on computational intelligence and data mining. 104-111. 10.1109/CIDM.2011.5949434.
[21] Wilson D.L (1972) Asymptotic properties of nearest neighbor rules using edited data IEEE Trans. Syst. Man. Cybern., 2 (3) (1972), pp. 408-421
[22] Two Modifications of CNN," in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 11, pp. 769-772, Nov. 1976, doi: 10.1109/TSMC.1976.4309452.
[23] Hart P (2006) The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theor., 14(3):515{516,
[24] Interject M, Zhang (2003) knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets, 2003.
[25] Drasko F, Srdjan S, Slobodan J, Silvana P, Misko S, Distance based resampling of imbalanced classes: With an application example of speech quality assessment, Engineering Applications of Artificial Intelligence, Volume 64, 2017, Pages 440-461, ISSN 0952-1976, https://doi.org/10.1016/j.engappai.2017.07.001.
[26] Peng M, Zhang Q, Xing X, Gui T, Huang X, Jiang Y.-G, Ding K., Chen Z (2019). Trainable Undersampling for Class-Imbalance Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 4707-4714. https://doi.org/10.1609/aaai.v33i01.33014707
[27] Lin W, Chih-Fong T, Ya-Han H, Jing-Shang J (2017) Clustering-based undersampling in class-imbalanced data.” Inf. Sci. 409 (2017): 17-26.
[28] Show-Jane Y, Yue-Shi L (2009) Cluster-based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications, Volume 36, Issue 3, Part 1, 2009, Pages 5718-5727, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2008.06.108. (https://www.sciencedirect.com/science/article/pii/S0957417408003527)
[29] Ahmed H, Traore I, Saad S (2018) Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
[30] Batista, Gustavo & Prati, Ronaldo & Monard, Maria-Carolina. (2004). A Study of the Behavior of Several Methods for Balancing machine Learning Training Data. SIGKDD Explorations. 6. 20-29. 10.1145/1007730.1007735.
[31] Koziarski, Michał, Wożniak, Michał (2017) CCR: A combined cleaning and resampling algorithm for imbalanced data classification" International Journal of Applied Mathematics and Computer Science, vol.27, no.4, 2017, pp.727-736. https://doi.org/10.1515/amcs-2017-0050
[32] Michał K, Michał W, Bartosz K (2020) Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise, Knowledge-Based Systems, Volume 204, 2020, 106223, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2020.106223. (https://www.sciencedirect.com/science/article/pii/S0950705120304330)
[33] Bunkhumpornpat C, Sinapiromsaran K (2015). CORE: Core-based synthetic minority over-sampling and borderline majority under-sampling technique, International Journal of Data Mining and Bioinformatics 12(1): 44–58.
[34] Mathew, Josey, Pang, Chee & Luo, Ming, Leong, Weng. (2017). Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines. IEEE Transactions on Neural Networks and Learning Systems. PP. 1-12. 10.1109/TNNLS.2017.2751612.
[35] Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-Sensitive Learning of Deep Feature Representations from Imbalanced Data. IEEE Trans Neural Netw Learn Syst. 2018 Aug; 29(8):3573-3587. doi: 10.1109/TNNLS.2017.2732482. Epub 2017 Aug 17. PMID: 28829320.
[36] Reddy H et al (2020) Text-mining-based Fake News Detection Using Ensemble Methods", International Journal of Automation and Computing, DOI: 10.1007/s11633-019-1216-5 (H. Reddy, 2020)
[37] Goldani MH, Momtazi S, Safabakhsh R (2021) Detecting fake news with capsule neural networks. Applied Soft Computing, Volume 101, 106991, ISSN 1568 4946, https://doi.org/10.1016/j.asoc.2020.106991. (https://www.sciencedirect.com/science/article/pii/S1568494620309303)
[38] Iftikhar A, Muhammad Y, Suhail Y, Muhammad OA (2020) Fake News Detection Using Machine Learning Ensemble Methods. Complexity, vol. 2020, Article ID 8885861, 11 pages. https://doi.org/10.1155/2020/8885861
[39] Kaggle (2018) Fake News Detection. Kaggle, San Francisco, CA, USA, https://www.kaggle.com/jruvika/fake-news-detection
[40] Nasir JA, Khan OS, Varlamis I (2020) Fake news detection: A hybrid CNN-RNN based deep learning approach. Elsevier, International Journal of Information Management Data Insights, https://doi.org/10.1016/j.jjimei.2020.100007
[41] Goseva K et al (2020) Identification of Security related Bug Reports via Text Mining using Supervised and Unsupervised Classification, https://ntrs.nasa.gov/search.jsp?R=20180004739 2020-02 02T17:46:02+00:00Z
[42] Yukari O, Ichiro K (2013) Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm”, Proceedings of the ACL Student Research Workshop, pages 46–51, Sofia, Bulgaria, August 4-9 2013. Association for Computational Linguistics
[43] Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).
[44] Horne B.D, Adali S (2017) This just in: fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In: the 2nd International Workshop on News and Public Opinion at ICWSM
[45] Aldwairi M, Alwahedi A (2018) Detecting Fake News in Social Media Networks” ScienceDirect, Procedia Computer Science 141 (2018) 215- 222
[46] Waikhom L, Goswami, RS (2019) Fake News Detection Using Machine Learning. Proceedings of International Conference on Advancements in Computing & Management (ICACM) Available at SSRN: https://ssrn.com/abstract=3462938 or http://dx.doi.org/10.2139/ssrn.3462938 les. In Proceedings of the Eighth International Joint Conference on Natural Language Processing Short Papers pp. 252{256)
[47] Padurariu C, Breaban M (2019) Dealing with Data Imbalance in Text Classification. Procedia Computer Science. 159. 736-745. 10.1016/j.procs.2019.09.229
[48] Bagui S, Li K (2021) Resampling imbalanced data for network intrusion detection datasets. J Big Data 8, 6 (2021). https://doi.org/10.1186/s40537-020-00390-x
[49] Liping C, Jiabao J, Yong Z (2021), HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition, Complexity, vol. 2021, Article ID 6877284, 9 pages, 2021. https://doi.org/10.1155/2021/6877284
[50] Li J, Wu Y, Fong S et al (2021) a binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data. J Supercomputing... https://doi.org/10.1007/s11227-021-04177-6
[51] Vishwa K, Wenhao Z, Arash N, Ramin R (2019), GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets, arXiv,abs/1910.10806
[52] Gu Xiaowei, Angelov P, Soares E (2019) A Self-Adaptive Synthetic Over-Sampling Technique for Imbalanced Classification
[53] Hu S.G, Liang Y.F, Ma L.T, He Y (2009) MSMOTE: Improving Classification Performance When Training Data is Imbalanced. In Proceedings of the 2009 Second International Workshop on Computer Science and Engineering, WCSE ’09, Washington, DC, USA, 28–30 October 2009; Volume 2, pp. 13–17.
[54] Sáez J.A, Krawczyk B, Wo ´zniak M (2016) Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit. 2016, 57, 164–178
[55] Zellers, Rowan H, Ari R, Hannah B, Yonatan F, Ali R, Franziska C, Yejin. (2019). Defending Against Neural Fake News.
[56] Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2012, 42, 463–484.
[57] Fernández A, García, S, Herrera F (2011) Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution. In Hybrid Artificial Intelligent Systems: Proceedings of the HAIS 2011 6th International Conference, Wroclaw, Poland, 23–25 May 2011; Corchado, E.; Kurzy ´nski, M., Wo ´zniak, M., Eds.; Springer: Berlin/Heidelberg, Germmany, 2011; Part I; pp. 1–10.
[58] Pattaramon V, Eyad E (2019)Neighbourhood-based undersampling approach for handling imbalanced and overlapped data, Information Sciences, Volume 509, 2020, Pages 47-70, ISSN 0020-0255, https://doi.org/10.1016/j.ins.2019.08.062. (https://www.sciencedirect.com/science/article/pii/S0020025519308114)
[59] Batista, Gustavo & Bazzan, Ana & Monard, Maria-Carolina. (2003). Balancing Training Data for Automated Annotation of Keywords: a Case Study.the Proc. Of Workshop on Bioinformatics. 10-18.
[60] El-Shafeiy E, Abohany A (2020) Medical imbalanced data classification based on random forests. In: Joint European-US Workshop on Applications of Invariance in Computer Vision (pp. 81–91). Springer, Cham
[61] i J, Kim H (2020) G-mean based extreme learning machie for imbalance learning. Dig. Signal Process. 98, 10267 (2020)
[62] Dongdong L, Ziqiu C, Bolu W, Zhe W, Hai Y, Wenli D (2021) Entropy-based hybrid sampling ensemble learning for imbalanced data. Int J IntelSyst. 2021; 36: 3039– 3067. https://doi.org/10.1002/int.22388
[63] Babu M. Pushpa S (2020). Genetic Algorithm-Based PCA Classification for Imbalanced Dataset. 10.1007/978-981-15-2780-7_59
[64] Susan S, Amitesh (2020). Hybrid of Intelligent Minority Oversampling and PSO-Based Intelligent Majority Undersampling for Learning from Imbalanced Datasets. 10.1007/978-3-030-16660-1_74
[65] Kusner M, Hernández-Lobato, J (2016). GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution
[66] Jang E, Gu S, Poole B (2017) Categorical reparameterization with Gumbel-Soft- max, in: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings
[67] YounusKhan J et al (2021) A benchmark study of machine learning models for online fake news detection.Elsevier, Machine Learning with Applications Journal, https://doi.org/10.1016/j.mlwa.2021.100032
[68] Reis JCS, Correia A, Murai F, Veloso A, Benevenuto F (2019) Supervised Learning for Fake News Detection. in IEEE Intelligent Systems, vol. 34, no. 2, pp. 76-81, March-April 2019, doi: 10.1109/MIS.2019.2899143
[69] Spearman C (1987) The proof and measurement of association between two things, Am. J. Psychol. 15 (1904) 72–101
[70] Nandwani P, Verma R (2021) A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 11, 81.https://doi.org/10.1007/s13278-021-00776-6
[71] Baptista, João, Gradim, Anabela (2020) Understanding Fake News Consumption: A Review. Social Sciences. 9. 10.3390/socsci9100185.
[72] Baccianella S, Esali A, Sebastiani F (2010) SentiWordNet 3.0, An enhanced Lexical resource for sentiment analysis and opinion mining in:7th international conference on language resources and evaluation (LREC), pp 200-2204
[73] Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with sub word information, Transactions of the association for computational linguistics, vol.5, pp.135-146, 2017, Distributed under a CC-BY 4.0 license
[74] Le Q, Mikolov T (2014) Distributed Representations of Sentences and Documents. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s)
[75] Chetana V, Kolisetty Soma S, Amogh K (2020). A Short Survey of Dimensionality Reduction Techniques. 10.1201/9781003043980-2.
[76] Tian L, Wang Z, Liu W et al (2021) An improved generative adversarial network with modified loss function for crack detection in electromagnetic nondestructive testing. Complex Intell. Syst. https://doi.org/10.1007/s40747-021-00477-9
[77] Sepp H, Jurgen S (1997) Long short-term memory. Neural computation”, 9(8):1735–1780
[78] Yang P, Paul D.Y, Juanita F, Bing B. Z, Zili Z, Albert Y. Z (2014) Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications." IEEE transactions on cybernetics44, no. 3: 445-455
[79] Radford A, Metz L, and Chintala S, “Addressing the Classification with Imbalanced Data with deep convolutional generative adversarial networks,” arXiv preprint arXiv: 1511.06434, 2015.
[80] Ayush J, Wael A, Yue W, Premkumar N, “Capsulegan: Generative adversarial capsule network,”in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
[81] Ge H, Xia Y, Chen X, Berry R, Wu Y (2018) Fictitious GAN: Training GANs with Historical Models. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, vol 11205. Springer, Cham. https://doi.org/10.1007/978-3-030-01246-5_8
[82] Iqbal, T., Qureshi, S., The Survey: Text Generation Models in Deep Learning., Journal of King Saud University Computer and Information Sciences (2020), doi: https://doi.org/10.1016/j.jksuci.
[83] Napierala K., Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46, 563–597. https://doi.org/10.1007/s10844-015-0368-1
[84] Vallada E, Ruiz R (2011). A genetic algorithm for the unrelated parallel machine scheduling problem with sequence dependent setup times. European Journal of Operational Research. 211. 612-622. 10.1016/j.ejor.2011.01.011.
[85] Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification, Pattern Recognition Letters, Volume 30, Issue 1, 2009, Pages 27-38, ISSN 0167-8655, https://doi.org/10.1016/j.patrec.2008.08.010
[86] Haibo H, Yunqian M (2013). Imbalanced Learning: Foundations, Algorithms, and Applications 10.1002/9781118646106.
[87] Davide C, Giuseppe J (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 21. 10.1186/s12864-019-6413-7.
[88] García V, Mollineda R.A, Sánchez J.S (2009) Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions. In: Araujo H., Mendonça A.M., Pinho A.J., Torres M.I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2009. Lecture Notes in Computer Science, vol 5524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02172-5_57
[89] Branco P, Torgo L, Ribeiro R (2015) A survey of predictive modelling under imbalanced distributions. ACM Comput Surv (CSUR). https://doi.org/10.1145/2907070
[90] Andrew P. B (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, Volume 30, Issue 7, 1997, Pages 1145-1159, ISSN 0031-3203,
[91] Ting K.M (2011) Confusion Matrix. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_157