Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Subject Areas : Clustering and Classification
1 - Computer Technology Department, Faculty of Information Technology, Zawia University, Libya
Keywords: Corpora creation, TFIDF-Vector space model, news articles, Arabic text classification,
Abstract :
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for the creation of Arabic text corpora. In particular, we create a text classification process for Arabic news articles downloaded from web news portals and sites. The suggested procedure is a pilot project that uses some human predefined set of documents that have been assigned to some subjects or categories. A vectorized Term Frequency, Inverse Document Frequency (TF-IDF) based information processing was used for the initial verification of the categories. The resulting validated categories used to predict categories for new documents. The experiment used 1000 initial documents pre-assigned into five categories of each with 200 documents assigned. An initial set of 2195 documents were downloaded from a number of Arabic news sources. They were pre-processed for use in testing the utility of the suggested classification procedure using the cosine similarity as a classifier. Results were very encouraging with very satisfying precision, recall and F1-score. It is the intention of the authors to improve the procedure and to use it for Arabic corpora creation.
[1] Olayah F., Alromina W. Automatic Machine Learning Techniques (AMLT) for Arabic Text Classification Based on Arabic Term Collections. Journal of Theoretical & Applied Information Technology. 2018 Jun 30;96 (12).
[2] Mirończuk MM, Protasiewicz J. A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications. 2018 Sep 15;106:36-54.
[3] Schneider, S. The biggest data challenges that you might not even know you have, https://www.ibm.com/blogs/ watson/2016/05/biggest-data-challenges-might-not-even-know/
[4] Shahmirzadi O, Lugowski A, Younge KA. Text Similarity in Vector Space Models: A Comparative Study. Available at SSRN 3259971. 2018 Sep 15.
[5] Al-Tahrawi MM, Al-Khatib SN. Arabic text classification using Polynomial Networks. Journal of King Saud University-Computer and Information Sciences. 2015 Oct 1;27(4):437-49.
[6] Zhu Z. Improving Search Engines via Classification. The University of London. 2011 May.
[7] Ahmed M, Elhassan R. Arabic Text Classification review. International Journal of Computer Science and Software Engineering. 2015 Jan 31;4(1):1-5.
[8] Syiam MM, Fayed ZT, Habib MB. An intelligent system for Arabic text categorization. International Journal of Intelligent Computing and Information Sciences. 2006 Jan 1;6(1):1-9.
[9] Mohammad AH, Alwada’n T, Al-Momani O. Arabic text categorization using support vector machine, Naïve Bayes and neural network. GSTF Journal on Computing. 2018 Jan 23;5(1).
[10] Elhassan R, Ali M. Arabic Text Classification Process. International Journal of Computer Science and Software Engineering. 2017 Nov 1;6(11):258-65.
[11] Jindal V. A Personalized Markov Clustering and Deep Learning Approach for Arabic Text Categorization. In Proceedings of the ACL 2016 Student Research Workshop 2016 (pp. 145-151).
[12] Bani-Ismail, B, Al-Rababah, K, Shatnawi, S., The effect of full word, stem, and root as index-term on Arabic information retrieval, Global Journal of Computer Science and Technology, 2011
[13] Agarwal R, Dhar V. Big data, data science, and analytics: The opportunity and challenge for IS research (2014).
[14] A. M. El-Halees, “Arabic text classification using maximum entropy,” IUG Journal of Natural Studies, vol. 15, no. 1, 2015.
[15] L. Khreisat, “A machine learning approach for Arabic text classification using N-gram frequency statistics,” Journal of Informatics, vol. 3(1), pp. 72-77, 2009.
[16] Al-Shalabi R, Kanaan G, Gharaibeh M. Arabic text categorization using KNN algorithm. In Proceedings of the 4th International Multiconference on Computer Science and Information Technology 2006 Apr 5 (Vol. 4, pp. 5-7).
[17] Duwairi, R.M.: Arabic text categorization. Int. Arab J. Inf. Technol. 4(2), 125–132 (2007) 17
[18] R. Al-Shalabi and R. Obeidat, “Improving knn Arabic text classification with n-grams based document indexing,” in Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt, 2008, pp. 108–112.
[19] Hmeidi I, Hawashin B, El-Qawasmeh E. Performance of KNN and SVM classifiers on full word Arabic articles. Advanced Engineering Informatics. 2008 Jan 1;22(1):106-11.
[20] Masand, B., Linoff, G. and Waltz, D., 1992, June. Classifying news stories using memory-based reasoning. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 59-65). ACM.
[21] Lam, W. and Ho, C.Y., 1998, August. Using a generalized instance set for automatic text categorization. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 81-89). ACM.
[22] Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In Icml 1997 Jul 8 (Vol. 97, pp. 412-420).
[23] El-Halees AM. A comparative study on Arabic text classification. A comparative study on Arabic text classification. 2008;30(2).
[24] Al-Saleem S. Associative classification to categorize Arabic data sets. Int. J. Acm Jordan. 2010;1(3):118-3.
[25] Mohamed A Mesleh A. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System. Journal of Computer Science. 2007;3(6):430-5.
[26] Chantar HK, Corne DW. Feature subset selection for Arabic document categorization using BPSO-KNN. In Nature and Biologically Inspired Computing (NaBIC), 2011 Third World Congress on 2011 Oct 19 (pp. 546-551). IEEE.
[27] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning 1998 Apr 21 (pp. 137-142). Springer, Berlin, Heidelberg.
[28] Harrag, F., El-Qawasmeh, E., & Pichappan, P. (2009). Improving Arabic text categorization using decision trees. Networked Digital Technologies, 110-115.
[29] Lewis DD, Ringuette M. A comparison of two learning algorithms for text categorization. In Third annual symposium on document analysis and information retrieval 1994 Apr 11 (Vol. 33, pp. 81-93).
[30] Apte C, Damerau F, Weiss S. Text mining with decision rules and decision trees. IBM Thomas J. Watson Research Division; 1998 Jun.
[31] Fuhr N, Hartmann S, Lustig G, Schwantner M, Tzeras K, Knorz G. AIR/X: A rule-based multistage indexing system for large subject fields. In Intelligent Text and Image Handling-Volume 2 1991 Apr 2 (pp. 606-623.
[32] Yang Y, Chute CG. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS). 1994 Jul 1;12(3):252-77.
[33] Dumais S, Platt J, Heckerman D, Sahami M. Inductive learning algorithms and representations for text categorization. In Proceedings of the seventh international conference on Information and knowledge management 1998 Nov 1 (pp. 148-155). ACM.
[34] Alshammari R. Arabic Text Categorization using Machine Learning Approaches. Inter J. of Advanced Computer Science and Applications. 2018 Mar 1;9(3):226-30.
[35] Abu-Errub A. Arabic text classification algorithm using TFIDF and chi-square measurements. International Journal of Computer Applications. 2014 Jan 1;93(6).
[36] Farghaly A, Shaalan K. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP). 2009 Dec 1;8(4):14.
[37] Liu RL. Context recognition for hierarchical text classification. Journal of the American society for information science and technology. 2009 Apr;60(4):803-13.
[38] Sharef BT, Omar N, Sharef ZT. An automated Arabic text categorization based on the frequency ratio accumulation. Int. Arab J. Inf. Technol. 2014 Mar 1;11(2):213-21.
[39] Alsaleem S. Automated Arabic Text Categorization Using SVM and NB. Int. Arab J. e-Technol. 2011 Jun;2(2):124-8.
[40] Abu-Errub A. Arabic text classification algorithm using TFIDF and chi-square measurements. International Journal of Computer Applications. 2014 Jan 1;93(6).
[41] Noaman HM, Elmougy S, Ghoneim A, Hamza T. Naive Bayes classifier based Arabic document categorization. In Informatics and Systems (INFOS), 2010 The 7th International Conference on 2010 Mar 28 (pp. 1-5). IEEE.
[42] Duwairi R, Al-Refai M, Khasawneh N. Stemming versus light stemming as feature selection techniques for Arabic text categorization. In Innovations in Information Technology, 2007. IIT'07. 4th International Conference on 2007 Nov 18 (pp. 446-450). IEEE.
[43] Duwairi RM. Machine learning for Arabic text categorization. Journal of the American Society for Information Science and Technology. 2006 Jun;57(8):1005-10.
[44] Ling CX, Huang J, Zhang H. AUC: a better measure than accuracy in comparing learning algorithms. In Conference of the Canadian society for computational studies of intelligence 2003 Jun 11 (pp. 329-341). Springer, Berlin, Heidelberg.
[45] Provost FJ, Fawcett T. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. In KDD 1997 Aug 14 (Vol. 97, pp. 43-48).
[46] William C, Robert ES, Singer Y. Learning to order things. Journal of Artificial Intelligence Research. 1999;10:243-70.
[47] Elazmeh W, Japkowicz N, Matwin S. Confidence Interval for the Difference in Classification Error. In American Association for Artificial Intelligence.
[48] Caruana R, Niculescu-Mizil A. Data mining in metric space: an empirical analysis of supervised learning performance criteria. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining 2004 Aug 22 (pp. 69-78). ACM.
[49] Drummond C, Holte RC. What ROC Curves Can't Do (and Cost Curves Can). In ROCAI 2004 Aug 22 (pp. 19-26).
[50] Goutte C, Gaussier E. A probabilistic interpretation of precision, recall, and F-score, with implication for evaluation. In European Conference on Information Retrieval 2005 Mar 21 (pp. 345-359). Springer, Berlin, Heidelberg.
[51] Newcombe RG. Two‐sided confidence intervals for the single proportion: comparison of seven methods. Statistics in medicine. 1998 Apr 30;17(8):857-72.
[52] Ou-Yang L. Newspaper: Article scraping & curation. Python Library. Retrieved. 2013.