A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
Subject Areas : Data MiningSaman Khalandi 1 , Farhad Soleimanian Gharehchopogh 2 *
1 - Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmai, Iran.
2 - Department of Computer Engineering, Urmia Branch, Islamic Azad University, Urmia, IRAN
Keywords: Text Document Classification, Naive Bayes, Feature Selection, Invasive Weed Optimization,
Abstract :
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, forming feature vectors, and final classification. In the presented model, the authors formed a feature vector for each document by means of weighting features use for IWO. Then, documents are trained with NB classifier; then using the test, similar documents are classified together. FS do increase accuracy and decrease the calculation time. IWO-NB was performed on the datasets Reuters-21578, WebKb, and Cade 12. In order to demonstrate the superiority of the proposed model in the FS, Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) have been used as comparison models. Results show that in FS the proposed model has a higher accuracy than NB and other models. In addition, comparing the proposed model with and without FS suggests that error rate has decreased.
[1] W. Hadi, Q.A. Al-Radaideh, S. Alhawari, Integrating associative rule-based classification with Naïve Bayes for text classification, Applied Soft Computing, Vol. 69, pp. 344-356, 2018.
[2] D. Mahata, R.R. Shah, J. Kuriakose, R. Zimmermann, J.R. Talburt, Theme-Weighted Ranking of Keywords from Text Documents Using Phrase Embeddings, IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), IEEE, pp. 184-189, 2018.
[3] A. Kulkarni, V. Tokekar, P. Kulkarni, Discovering Context of Labeled Text Documents Using Context Similarity Coefficient, Procedia Computer Science, Vol. 49, pp. 118-127, 2015
[4] K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Systems with Applications, Vol. 66, pp. 245-260, 2016.
[5] S. Ramanna, J.F. Peters, C. Sengoz, Application of Tolerance Rough Sets in Structured and Unstructured Text Categorization: A Survey, Thriving Rough Sets, Springer, Vol. 708, pp. 119-138, 2017.
[6] A.R. Mehrabian, C. Lucas, A novel numerical optimization algorithm inspired from weed colonization, Ecol. Inform. 1(4): 355-366, 2006.
[7] A. McCallum, K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification, In AAAI-98 workshop on learning for text categorization, Vol. 752, pp. 41-48, 1998.
[8] X. Deng, Y. Li, J. Weng, J. Zhang, Feature selection for text classification: A review, Multimedia Tools and Applications, pp. 1-20, 2018.
[9] M. Rogati, Y. Yang, High-performing variable selection for text classification, in: CIKM ’02 Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 659-661, 2002.
[10] Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: The Fourteenth International Conference on Machine Learning (ICML97), pp. 412-420, 1997.
[11] J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan, Michigan, USA, 1975.
[12] J. Kennedy, R. C. Eberhart, Particle Swarm Optimization, In Proceedings of the IEEE International Conference on Neural Networks, pp. 1942-1948, 1995.
[13] A. Trstenjak, S. Mikac, D. Donko, KNN with TF-IDF based Framework for Text Categorization, Procedia Engineering, Vol. 69, pp. 1356-1364, 2014.
[14] Y. Ko, J. Seo, Text classification from unlabeled documents with bootstrapping and feature projection techniques, Information Processing & Management, Vol. 45, Issue 1, pp. 70-83, 2009
[15] D. Ghasempour, F.S.Gharehchopogh, A New Approach for Feature Selection in Text Documents Classification by Using Hybrid Model of Bat and K-Nearest Neighborhood Algorithms, Islamic Azad University, Urmia Branch, Thesis, Summer 2016.
[16] A. Allahvirdipour, F.S. Gharehchopogh, New Approach in Features Selection in Text Documents Classification using the Hybrid Model Algorithms of Naive Bayes and K-Means, Islamic Azad University, Urmia Branch, Thesis, Spring 2016.
[17] R. Habibpour, K. Khalilpour, A New Hybrid K-means and K-Nearest-Neighbor Algorithms for Text Document Clustering, International Journal of Academic Research, Vol. 6 Issue 3, pp. 79-84, 2014
[18] M. Karabulut, Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection, Knowledge-Based Systems, Vol. 54, pp. 288-297, 2013.
[19] A.K. Uysal, S. Gunal, Text classification using genetic algorithm oriented latent semantic features, Expert Systems with Applications, Vol. 41, Issue 13, pp. 5938-5947, 2014
[20] T. Wei, Y. Lu, H. Chang, Q. Zhou, X. Bao, A semantic approach for text clustering using WordNet and lexical chains, Expert Systems with Applications, Vol. 42, Issue 4, pp. 2264-2275, 2015
[21] W. Zhang, X. Tang, T. Yoshida, TESC: An approach to TExt classification using Semi-Supervised Clustering, Knowledge-Based Systems, Vol. 75, pp.152-160, 2015
[22] K.K. Bharti, P.K. Singh, Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering, Applied Soft Computing, Vol. 43, pp. 20-34, 2016.
[23] D. AbuZeina, F.S. Al-Anzi, Employing fisher discriminant analysis for Arabic text classification, Computers & Electrical Engineering, in press, corrected proof, Available online 10 November 2017.
[24] R. Wongso, F.A. Luwinda, B.C. Trisnajaya, O. Rusli, Rudy, News Article Text Classification in Indonesian Language, Procedia Computer Science, Vol. 116, pp. 137-143, 2017.
[25] H.P. Luhn, A Statistical Approach to the Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development, Vol. 1, No. 4, pp. 309-317, 1957.
[26] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.
[27] R.S. Michalski, I. Bratko, M. Kubat, Machine Learning and Data Mining: Methods and Applications, New York: Wiley, 1998.
[28] D. Francois, Binary classification performances measure cheat sheet, 2009.
[29] C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/?mlearn/MLRepository.html], University of California. Department of Information and computer science, Irvine, CA, 1998, pp. 55
[30] http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
[31] http://ana.cachopo.org/datasets-for-single-label-text-categorization
[32] A. Onana, S. Korukoglub, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Systems with Applications, Vol. 57, pp. 232-247, 2016.
[33] A.K. Uysal, An improved global feature selection scheme for text classification, Expert Systems with Applications, Vol. 43, pp. 82-92, 2016.
[34] H. Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems, Vol. 24, Issue 7, pp. 1024-1032, 2011.
[35] W. Zong, F. Wu, L.K. Chu, D. Sculli, A Discriminative and Semantic Feature Selection Method for Text Categorization, International Journal of Production Economics, Vol. 165, pp. 215-222, 2015.
[36] C. Veenhuis, Binary Invasive Weed Optimization, Second World Congress on Nature and Biologically Inspired Computing (NaBIC), pp. 449-454, 2010.
[37] L.M. Abualigah, A.T. Khader, Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering, The Journal of Supercomputing, Vol. 73, Issue 11, pp. 4773-4795, 2017.
1
Journal of Advances in Computer Engineering and Technology
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
.
Received (Day Month Year)
Revised (Day Month Year)
Accepted (Day Month Year)
Abstract— With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, forming feature vectors, and final classification. In the presented model, the authors formed a feature vector for each document by means of weighting features use for IWO. Then, documents are trained with NB classifier; then using the test, similar documents are classified together. FS do increase accuracy and decrease the calculation time. IWO-NB was performed on the datasets Reuters-21578, WebKb, and Cade 12. In order to demonstrate the superiority of the proposed model in the FS, Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) have been used as comparison models. Results show that in FS the proposed model has a higher accuracy than NB and other models. In addition, comparing the proposed model with and without FS suggests that error rate has decreased.
I. INTRODUCTION
T
DC is a major part of content analysis of texts and is used in many applications such as text filtering, automatic response systems, and application relevant to automatic organization of documents [1]. Nowadays, a huge mass of information and knowledge is in digital text format. Considering the growth rate of knowledge, document classification to the end of reducing information complexity and easy and quick accessing to information is a very important issue. The purpose of document classification is accessing the data quickly. Nonexistence of a classification system leads to cost increase and spending more time for carrying out for text operations. This is because of the long time needed to be spent on finding the documents in traditional document classification methods.
Many documents are stored in electronic text formats. A required model for extracting knowledge from this big mass of text information is using TDC. As a significant technique in information retrieval and natural language processing, information classification is challenging and an effective solution for organizing text databases [2]. Considering the growth of electronic texts and documents, using an efficient method for data retrieval is mandatory. For retrieving data, understanding the main concept of the text, text classification, finding the proper words for searching, and keyword extraction are the best ways. Keywords are a set of important words in a document that provide us with a description of the document content. They are useful for different purposes. Through finding the keywords, we can have a grasp on the contents of text documents [3]. Overall, keywords are a useful tool for searching a big mass of documents in a short time. Two major methods for keywords extraction are [4]:
· Term Frequency-Inverse Document Frequency (TF-IDF) methods: in these methods, the repetition frequency of a word in a document is regarded in contrast with its repetition frequency in the whole document set.
· Machine Learning (ML) methods: in these methods, by means of a set of training documents and specific keywords for them keyword extraction process is modeled as a classification problem. These methods are highly flexible.
Text document analysis through ML techniques, intelligent information retrieval, natural language processing, and etc. is a subcategory of data mining. These techniques were first applied to structured data which those that are same in structure but are gathered in a file completely independent of one another. However, in the case of text documents that are mainly either unstructured or semi-structured we must first make them structured and then use these methods for extracting information and knowledge from them [5].
TDC means assignment of text documents according to their content to one or more predefined classes. The goal of TDC is to assign text documents to different predefined classes. In classification, there is a training set of documents with specific classes by means of this set, classification is specified, and the class of the new document is determined. For measuring the effectiveness of a TDC model, a test set is defined independent of the training set. The estimated labels are compared with the real labels. The ratio of correctly classified documents to the total documents is calculated based on the accuracy.
In this paper, we propose a hybrid model of IWO [6] and NB classifier [7] for TDC. In the IWO-NB model, we use IWO for FS and NB classifier for classification of similar groups. However there are many volumes of words in a typical text collection, most of the words contain little or no information in the TDC. Thus FS or dimension reduction becomes necessary because it not only reduces the measurement and storage requirements, but also improves prediction performance. IWO is a new and powerful optimization algorithm that imitates the adaptability and randomness of IWO colonies. By definition, IWO is a plant that grows and reproduces in unintended places and according to the environment; it acts as a pest for useful agricultural plants and hinders their growth. Even though it is very simple, IWO is very quick and effective in finding the optimum locations; and acts similar to features of the original and natural IWO in reproduction, growth, and struggle for survival in a colony. NB algorithm is a technique of data mining for classification. NB has characteristics such as simplicity, high computational efficiency, and good classification accuracy, especially for high dimensional data such as texts. In this technique, different classes are considered as a supposition with a probability. Any new training data increases or decreases the probability of prior hypotheses; and eventually the hypotheses with the highest probabilities are considered as a class and are assigned to a label.
Commonly used feature selection methods are the filter methods, such as chi-square (CHI), information gain (IG). Some comparative studies are given by [8, 9, and 10]. These methods simply calculate the scores for each feature and then remove those features with small scores. In this paper, three types of metaheuristic algorithms such as IWO, Genetic Algorithm (GA) [11] and PSO [12] algorithm were used to extract the features, due to the necessity of selecting the feature and achieving high precision. The reason for choosing IWO's algorithm for FS compared to GA and PSO models is the fact that the precision of detection of IWO is high and also it is more precise in choosing the feature.
The remainder of this paper is organized as follows: in Section 2, we review the related works done on TDC. In Section 3, the proposed model is described. In Section 4, experimental results are introduced and also models of GA-NB, PSO-NB, and IWO-NB for FS and classification is presented. In Section 5, assessment of the results of the proposed model is carried out; and the model is compared with other models. And eventually, in Section 6, conclusions are made and suggestions are made for future studies.
II. Related Work
Considering the big volume and wide domain of text documents that are available from online and other sources, unless they are properly classified, retrieval and processing of unclassified text documents will face many problems. The most significant step in classification of text documents is choosing the proper feature space; and accuracy of a model depends highly on the chosen keywords that define the domain of the document.
K-Nearest Neighbor (KNN) model and TF-IDF have been recommended for classification of text documents [13]. Results are, performed on WebKb dataset; highest classification accuracy value for KNN is 0.92. Hybrid model of Support Vector Machine (SVM), NB, and KNN has been recommended for TDC under the title TCFP [14]. Assessment is carried out on Reuters-21578 [30], WebKb [31], and Cade 12 [31] datasets. Accuracy of the factor F-Measure for the three datasets is 86.19, 75.47, and 89.09 respectively. In comparison with SVM, NB, and KNN it has a higher accuracy.
Hybrid model of KNN and Bat Algorithm (BA) has been recommended for TDC [15]. In this model, they used BA for FS and KNN algorithm for text similarity. That so, text documents are first preprocessed; and the keywords in the document are extracted. Then, based on repetition a specific weight is set for each keyword. Assessment is carried out on Reuters-21578, WebKb, and Cade 12 datasets. Comparisons suggest that the proposed model is more accurate than the models K-Means, K-Means-KNN, and NB-K-Means.
Hybrid model KNN-K-Means [16] has been recommended for clustering of text document. In this model, KNN algorithm is used for identifying similar clusters; and K-Means algorithm is used for accuracy in document clustering. Result on Reuters-21578 show that the proposed model is more accurate than K-Means model.
Hybrid model NB-K-Means has been tested on the datasets Reuters-21578, WebKb, and Cade 12 datasets for TDC [17]. Results indicate that the hybrid model NB-K-Means is more accurate than K-Means model. Moreover, the highest accuracy in the proposed model is that of K=3 which is %93.30. Models have been recommended for reducing the size of data using PSO algorithm [18]. PSO algorithm has been used along with hybrid of fuzzy, NB, and SVM models. Results were assessed on Reuters and OHSUMED datasets. Assessments indicate that the accuracy of fuzzy model is higher than other models.
GALSF model [19] has been proposed based on GA and effective FS. In this model, other than FS, the relations between features have been considered; and these relations have been used for finding similar classes. Each feature gets a score according to repetition; and the features with the highest scores are influential in classification and the number of classes. Results of dataset Reuters-21578 shows that GALSF model is more accurate than other models.
A model has been proposed based on semantic web and WordNet for text document clustering [20]. In the model based on semantic web, closeness and synonymy of features have been used for accuracy of clustering. Based on the semantic model of the words, each cluster chooses a feature as cluster head, and if some features are vague, WordNet is used for finding semantic similarity. Results are obtained from performing on Reuters-21578 dataset and the percentage of features and clusters distribution is shown according to F-Measure factor.
TESC model [21] using SVM and back propagation Artificial Neural Network (ANN) has been recommended for classification of text documents of Reuters-21578 dataset. SVM is a method for data classification based on two pages. Data is grouped on top and bottom of the page. The ANN assesses document identification accuracy based on data training and testing. Results suggest that accuracy of back propagation ANN is lower than that of SVM.
Bharti et al. suggested chaotic BPSO model hybrid for text document clustering [22]. Chaos factor was used for selecting optimum features of BPSO model. First, using BPSO, at feature indexing stage, features are selected; then, using chaos, closeness of features and selection of similar features in one vector are done. Results of performing on Reuters-21578, Classic4, and WebKb show that in identification BPSO model is more accurate than models SGA, CBPSO, and AIWPSO.
AbuZeina, and Al-Anzi [23], proposed the capacity of Linear Discriminant Analysis (LDA) for Arabic text classification. LDA, also known as Fisher’s LDA, is one of the popular dimensionality diminish techniques that can show good performance in pattern recognition tasks. On other words, the present study is an attempt to understand whether the LDA is adequate for text classification such as other celebrated successful implementations, face recognition is an example. The prior art shows that the LDA is rarely used for Arabic text classification despite its good capabilities in dimensionality reduction. Therefore, this work is focused on the implementation of the LDA method for Arabic text classification as such applications generally contain sizable vocabularies that lead to large features and vectors. The results of experiments showed that the efficiency of the semantic loss LDA feature vectors is almost the same as the semantic rich latent semantic indexing (LSI) method. In opposition to, LSI employs an SVD method to generate semantic rich features. Semantic rich means that the method preserves and understands the inherent latent relationships between the words in the different documents. Besides the LAD, there is a one favorable dimensionality reduction technique such as singular value decomposition (SVD). Benchmarks comparison showed that the LDA is one of the worthy methods as it gives promising results when compared with SVM, KNN, NN, NB, cosine measure, etc. For instance, the SVD-SVM scored accuracy up to 84.75% while the LDA scored 84.4%. This results point out to the important of employing LDA for text classification.
In [24], news articles which are publicized in www.cnnindonesia.com are crawled with the total number of 5,000 documents. The listed documents consist of 1,000 documents for each class of: Health, Sports, Economy, Politic, and Technology. The documents are randomly partition with the ratio of 80:20 for training and testing goals. The feature selections in this research are done by using TF-IDF and SVD. The classifier used in the experiments of this research is NB and SVM. Comparisons have been done based on the TF-IDF and Singular Value Decomposition (SVD) algorithm for FS, while also compared the Multinomial Naïve Bayes (MNB), Bernoulli Multivariate Naïve Bayes (BNB), and SVM for the classifiers. Based on the test results, the hybrid of TF-IDF and MNB classifier gave the highest result compared to the other algorithms, which precision is 0.9841 and its recall is 0.9840. The hybrid of TFIDF + Multinomial Naïve Bayes (MNB) has provided the highest value of precision and recall, which is around 98.4% followed by the combination of TFIDF and BNB, which is around 98.2%. In terms of time consumed to process the data, MNB and BNB both gave the best result despite having very huge amount of data extracted by TF-IDF. In Table (1), comparison of the proposed models for TDC by researchers is shown.
TABLE 1 COMPARISON OF THE PROPOSED MODELS FOR TDC | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The KNN and SVM models are performance classifiers, but the KNN does not have unique results, and each time it executes a non-similar response to the previous one. The KNN model uses all educational prototypes when it comes to decision making, which involves some disadvantages, including low speed classification and high memory requirements. The SVM model, despite having unique results, has high computational time. The SVM model, with the entry of unclassified new samples, uses all of the previous educational protocols in the classifier update, which has a high cost. Early convergence is one of the main problems in the PSO algorithm. The particle gradually rotates in the search space near the optimal general point and does not explore the rest of the space, in other words particles converge. Because the particle velocity decreases with increasing frequency, therefore, the algorithm has to converge to the best that has been discovered so far and is not guaranteed to be the best global solution. This is the result of an inappropriate balance between local and global searches. In PSO algorithm, is preferred in first repetitions of global search, and it helps to improve performance and in final repetitions, global searches are reduced, and in order to maximize the information obtained, local searches are preferred.
III. PROPOSED MODEL
When there is independence of features, the accuracy of NB classifier also decreases. In the IWO-NB model, IWO algorithm is used for enhancing the accuracy of NB classifier. Features with the lowest differentiation effect are omitted considering the total of omitted features, and the remaining features are given to NB. The text could have too many features and/or correlated features, which cause both inefficiency and inaccuracy during TDC. As a result, ranking the features by their distinctiveness and only selecting the distinctive ones to perform TDC can help achieve a better TDC performance. Carrying out the FS process based on IWO algorithm results in enhancement of NB classifier in domains with codependent features. In addition, due to omission of less important features, the proposed model increases the calculation speed and yields the optimum answer in a shorter time. In Figure (1), the flowchart of the proposed model is presented.
|
Fig. 1. Flowchart of the Proposed Model |
Table (2) displays the pseudo code of the proposed model.
TABLE 2 THE PSEUDO CODE OF THE PROPOSED MODEL | ||||||||||||||||||||||||||||
|
In reading the datasets stage, the datasets Reuters-21578, WebKb, and Cade 12 are read and then they enter the preprocessing stage. In preprocessing, omission of irrelevant words and verbs takes place. In this step, is removed the functional words that are used to construct nature language documents but not related to any specific topics, such as ‘‘a”, ‘‘an”, ‘‘the”, ‘‘in”, ‘‘of”, ‘‘to”, etc. In the context of TDC, functional words are common words that are not related to the concept of the text. The stop-words to be consist of the pronouns, conjunctions, papers, and prepositions that should be removed for the sake of dimension reduction. In keywords extraction stage, using Equation (1), keywords counting takes place. The two basic parameters in term weighting strategies are raw term frequency TF (number of terms in D) and inverse document frequency IDF (term occurrence across a collection). In this paper, we used TF to obtain weight of the terms and then converted the results to vector space model, Di={wi1, wi2,…,wit}. Here, i, w and t denote the index of document, the weight of the terms in the document and the total number of terms, respectively. Equation (1) is one of the TF methods in which (tk, di) is repetition frequency of each feature tk in the document di [25].
| (1) |
Different weighting schemes such as the Term Frequency (TF) model [25], and TF-IDF model [25] can be used to assign a weighting value for each term feature and, accordingly, determine the document vector. The weighting is often associated with the frequency of each term. In IWO algorithm stage, the initial population and the vectors are formed. In this stage, vectors are formed based on words’ weight. IWO algorithm starts the search, scrutinizes the distribution of weights, and for weights similar to one another defines one vector. The operation goes on until the placement of weights in the vectors. Then the vectors are assessed and their fit is calculated according to Equation (2).
| (2) |
In IWO, each weed in the population shows one candidate solution for the problem. Each weed contains some position having dim dimension which is denoted as vector and that have values either 0 or 1 as shown in Figure (2). Each dimension is treaded as one feature. From the Fig. 4, we can say that Weed X has dim (here dim=200) feature that has value either 0 or 1. If the value at position j is 1 that means j the feature is selected otherwise it is not selected. For generate values 0 and 1, we change IWO to Binary IWO [36]. Binary IWO determines its binary seeds in a normally distributed neighborhood in the space of bit-strings (0 or 1). The normal distribution is realized over the number of different bits.
Fig. 2. Solution representation of a weed
The objective function for the proposed model is the mean absolute difference (MAD) [37].
| (3) |
| (4) |
Where, is the number of selected features in text document, is the mean value of the vector is the weighting value of feature k in document i and j is the number of features in the original text dataset.
After that, in sub-features selection stage, vectors are chosen and enter testing and training stages. For classifying text documents, first we divide them into two sets, namely training and testing. We form the model with training set and study it with testing set, so that the previous model would have a high accuracy. In fact, testing set is formed to be used for determining the accuracy of the model formed from the training set. In addition, classification of training documents is done according to NB classifier.
Assessment of fit function is carried out to certify accuracy. If the accuracy of the classification is deemed acceptable, the classification is shown as output; otherwise, search space is updated for getting to a better answer. For updating the search space, changes need to be made to solutions vector. For these changes, we use cosine distance according to Equation (5) [26].
| (5) |
In Equation (5), w and v are word weight and vector respectively. Each vector is assessed with its word weight. If the value of the first vector is bigger than that of the second vector, a random number of vectors’ indices are switched according to Equation (6).
| (6) |
In Equation (6), the parameter wmax is the highest weight value in the vector k, vi is the ith index of the vector k, and fmin is the fit function of the kth vector.
3.1. Naive Bayes Classifier
In NB classifier, classification input include parameter d, i.e. text documents, C={c1,c2, …,cj} , i.e. classes, and training data, (d1, c1), …,(dm, cm). NB classifier is defined for documents and classes according to Equation (7). In Equation (7), parameters w and c are number of words and documents respectively.
| (7) |
NB classification starts with the initial step of analyzing the text document by extracting words which are contained in the document to generate a list of words. The list of words is constructed with the assumption that input document contains words w1, w2, w3,…, wn-1, wn, where the length of the document (in terms of number of words) is n. For explanation of NB classifier in TDC, note Table (3). In Table (3), there are 4 training documents and 1 test document. We determine the neighborhood of the words by means of NB and allocate document 5 to class c.
TABLE 3
WORD CLASSIFICATION WITH NB CLASSIFIER
Class | Words | Documents | Dataset |
c | Program Project Project | 1 |
Training |
c | Project Pipeline Program Project | 2 | |
c | Project Structure | 3 | |
j | Computer Software Project | 4 | |
? | Project Project Project Computer Software | 5 | Test |
In Table (3), the probability of c and j are P(c) =3/4 and P(j) =1/4. In Table (4), percentage of the words in documents c and j is assessed. In NB classification method, all features are assumed to be independent and have different weight.
TABLE 4
EXAMINING CLASSES’ PROBABILITY FOR TDC
Evaluating Possibilities | Class |
P(Project |c)=(6+1)/(8+6)=6/14=3/7 | Class c |
P(Computer |c)=(0+1)/(8+6)=1/14 | |
P(Software |c)=(0+1)/(8+6)=1/14 | |
P(Project |j)=(1+1)/(3+6)=2/9 | Class j |
P(Computer |j)=(1+1)/(3+6)=2/9 | |
P(Software |j)=(1+1)/(3+6)=2/9 |
In Table (4), we see if document number 5 is closer to document c or document j. In Table (5), the probability of class c is higher. Therefore, document number 5 belongs to c. the probability of c is higher because in document number 5 the word project is repeated 3 times.
TABLE 5
CLASS SELECTION FOR AN UNIDENTIFIED DOCUMENT IN NB CLASSIFIER
class selection |
P(c |d5)=3/4*(3/7)3*1/14*1/14=0.0003 |
P(j |d5)=1/4*(2/9)3*2/9*2/9=0.0001 |
3.2. Assessment Factors
The results of the proposed model must be analyzed at assessment stage in order to reveal their value and as a result the effectiveness of the model. We can calculate these factors both for the training datasets at the training stage and for training records at the assessment stage. There are different factors for assessment such as precision, recall, F-Measure, and accuracy. For assessment of the IWO-NB model, we use the factor accuracy [27] [28]. Precision (P), Recall (R), and F-Measure are widely used metrics in the text mining literature for the text categorization. Precision measures total number of correct positive predictions to the total numbers of positive predictions and Recall measures total number of correct positive predictions to the total number of positive documents. F-Measure is a harmonic hybrid of P and R.
| (8) |
| (9) |
| (10) |
| (11) |
| (12) |
| (13) |
The parameter TN represents the records with a real positive class that were correctly identified as positive by the algorithm. TP represents the records with a real negative class that were correctly identified as negative by the algorithm. FP represents the records with a real negative class but mistakenly identified as positive by the algorithm. FN represents the records with a real positive class but mistakenly identified as negative by the algorithm.
IV. Experimental Results
In first, the performance of models in this paper has been tested using 13 different datasets. These datasets are taken from UCI machine learning repository [29] and their description is given in Table (6). Some of these data sets show missing data. The missing data is replaced with the average of the values taken on by the features; in addition, the dataset features are normalized. The sizes of the train and test sets are shown in Table (6). Thus, 75% of data is used in training process as a train set, and the remaining 25% of data is used in testing process as a test set. Three criteria were reported to evaluate each approach: classification accuracy, Error Rate, computational time.
TABLE 6 DATASETS DESCRIPTION | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The performance of the models is assessed by various analyses using datasets publicly available in the UCI data repository. To assess the classification performance, the classification accuracy is used and compared with the results of GA-NB and PSO-NB. As Table (7) shown, IWO-NB acquires the best accuracy. Obviously, the classification accuracy of all datasets in IWO-NB is better than GA-NB and PSO-NB, and the classification accuracy in PSO-NB is better than GA-NB. FS is one of the key factors in enhancing the classifier abilities in the classification problem. In this paper three variant metaheuristic algorithms based on NB classifier were proposed.
TABLE 7 RESULTS OF MODELS FOR 13 DIFFERENT UCI DATASETS BASED ON (ACCURACY/FS) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have also compared the error rate of different models using the training and testing samples of each dataset. When we compare the models according to Table (8), we observe that the worst case performance of error rate is belonging to GA-NB. The IWO-NB is successful on almost all datasets except the one dataset “Horse” in terms of error rate value.
TABLE 8 ERROR RATE ON TRAINING AND TESTING SETS WITH EACH MODEL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Analyzing the error rate shown in Figure (3), the IWO-NB obtained a lower error value than the GA-NB, and PSO-NB.
|
Fig. 3. The comparison of the percentage of the error rate for each model |
All these models were executed on the same machine with configurations: Intel (R) Core (TM) I7-4510U CPU, 6 GB RAM and Windows 8.1 Operating System. All models use the same parameter settings and are tested on the same datasets, so we used the computational time to compare between the performances of the proposed models. Table (9) presents the computational time (in seconds) required by each model to give near optimal solution.
TABLE 9 COMPARISON OF MODELS BASED ON EXECUTE TIME | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Experimental results in Table (9) show that, computational time of IWO-NB is shorter than that of GA-NB and PSO-NB. IWO-NB can get a FS in very short time when dealing with the relatively large-scale datasets. Due to GA, PSO and IWO are based on meta-heuristic technique; their results can be different in different runs. Figure (4) shows results obtained based on computational time. Regarding the running times of the models, the best performance is obtained by IWO-NB and the worst performance is obtained by GA-NB. IWO has vigorous exploration ability; it is a gradual searching process that approaches optimal solutions. The execute time of IWO is affected more by the problem dimension (feature numbers), and the size of data. For some datasets with more features, after finding a sub-optimal solution, the GA cannot find a better one. However, IWO can search in the feature space until the optimal solution is found. The GA is affected greatly by the number of features.
|
Fig. 4. Chart of Comparison of Models based on Execute Time |
V. Results and Assessment
In this section the assessment is done and results are presented on Reuters-21578, WebKb, and Cade 12 datasets in VC#.NET 2017 programming language. The primary population and the repetition number in IWO algorithm are 50 and 100 respectively. For showing the efficiency of the proposed model, the dataset was performed in NB classifier first. All experiments are conducted on three different benchmark datasets Reuters-21578, WebKb, and Cade 12. These datasets are pre-classified into several categories. The Reuters-21578 dataset is a standard and widely distributed collection of news published by Reuter’s newswire in 1987. It consists of 21,578 documents, which are distributed non-uniformly over 135 thematic categories. The WebKB dataset is prepared by Craven in 1998. It contains 8,282 web pages gathered from the four academic domains. The original dataset has seven categories, but only four of them course, faculty, project and student are used. The cade 12 consists of 40983 documents. The documents in the Cade12 correspond to a subset of web pages extracted from the CADE Web Directory, which points to Brazilian web pages classified by human experts.
5.1. Naive Bayes Classifier
In Table (10), the results of the datasets according to NB classifier are shown. The values of the factor accuracy in Reuters-21578, WebKb, and Cade 12 are 0.7012, 0.7265, and 0.7045 respectively. The dataset WebKb has the highest accuracy.
TABLE 10 RESULTS OF THE DATASETS ACCORDING TO NB CLASSIFIER | ||||||||||||||||||||||||||||
|
5.2. GA Results in FS
In Table (11), the results of the GA model are shown in Reuters-21578 with the selection of different attributes. As you can see in Table (11), the Accuracy criterion value with 160 features is 0.7684.
TABLE 11 THE RESULTS OF THE GA MODEL WITH A FS ON REUTERS-21578 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (12), the results of the GA model are shown with the choice of different properties on WebKb. In Table (12), the Accuracy criterion value with 60 features is 0.8197.
TABLE 12 GA RESULTS WITH THE SELECTION OF DIFFERENT FEATURES ON WEBKB | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (13), the results of the GA model are shown in Cade 12 by selecting different attributes. In Table (13), the Accuracy criterion value for the 120 features is 0.8601.
TABLE 13 GA RESULTS WITH A SELECTION OF DIFFERENT FEATURES ON CADE 12 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
5.3. PSO Results in FS
In Table (14), the results of the PSO model are shown by selecting different features on Reuters-21578. In Table (14), the Accuracy criterion value with 20 features is 0.8537.
TABLE 14 RESULTS OF THE PSO MODEL BY SELECTING DIFFERENT FEATURES ON REUTERS-21578 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (15), the results of the PSO model are shown by selecting different features on WebKb. In Table (15), the Accuracy criterion value with 40 features is 0.8920.
TABLE 15 RESULTS OF THE PSO MODEL BY SELECTING DIFFERENT FEATURES ON WEBKB | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (16), the results of the PSO model are shown in Cade 12 by selecting different attributes. In Table (16), the Accuracy criterion value for the 60 features is 0.8968.
TABLE 16 RESULTS OF THE PSO MODEL BY SELECTING DIFFERENT FEATURES ON CADE 12 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
5.4. Proposed Model without FS
In Table (17), the results of the datasets according to the proposed model without FS are shown. The values of the factor accuracy in Reuters-21578, WebKb, and Cade 12 are 0.7625, 0.7258, and 0.7414 respectively.
TABLE 17 RESULTS OF THE DATASETS ACCORDING TO PROPOSED MODEL WITHOUT FS | ||||||||||||||||||||||||||||
|
5.5. Proposed Model with FS
In Table (18), the results of the proposed model with various FS are presented on Reuters-21578. In Table (18), with 140 features accuracy factor is 0.9687.
TABLE 18 RESULTS OF THE PROPOSED MODEL WITH VARIOUS FS ON REUTERS-21578 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (19), the results of the proposed model with various FS are presented on WebKb. In Table (19), with 160 features accuracy factor is 0.9647.
TABLE 19 RESULTS OF THE PROPOSED MODEL WITH VARIOUS FS ON WEBKB | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (20), the results of the proposed model with various features selection are presented on Cade 12. In Table (20), with 100 features accuracy factor is 0.9614.
TABLE 20 RESULTS OF THE PROPOSED MODEL WITH VARIOUS FS ON CADE 12 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In Table (21), results of the proposed model with various FS are shown for the datasets based on error rate factor.
TABLE 21 RESULTS OF THE PROPOSED MODEL WITH VARIOUS FS ACCORDING TO ERROR RATE FACTOR | |||||||||||||||||||||||||||||||||||||||
|
In Figure (5), diagram of comparison of the error rate factor on Reuters-21578 according to FS is shown. In Figure (5), it is easily seen that with 140 features, the lowest error rate on Reuters-21578 is that of the proposed model.
|
Fig. 5. Comparison Diagram of Error Rate based on FS on Reuters-21578 |
In Figure (6), diagram of comparison of the error rate factor on WebKb according to FS is shown. In Figure (6), it is easily seen that with 160 features, the lowest error rate on WebKb is that of the proposed model.
|
Fig. 6. Comparison Diagram of Error Rate based on FS on WebKb |
In Figure (7), diagram of comparison of the error rate factor on Cade 12 according to FS is shown. In Figure (7), it is easily seen with 100 features, that the lowest error rate on Cade 12 is that of the proposed model.
|
Fig. 7. Comparison Diagram of Error Rate based on FS on Cade 12 |
In Table (22), comparison of the NB classifier, the proposed model without FS, and the proposed model with FS are shown according to error rate factor.
TABLE 22 COMPARISON OF MODELS ACCORDING TO ERROR RATE | |||||||||||||||||||
|
In Figure (8), comparison of the NB classifier, the proposed model without FS, and the proposed model with FS are shown according to error rate factor.
|
Fig. 8. Comparison Diagram of Models According to Error Rate |
5.6. Comparison and Assessment
In this section, the results of the proposed model are compared with ML techniques on the datasets Reuters-21578, WebKb, and Cade 12.
5.6.1. Machine Learning Models
In Table (19), comparison of the proposed model and different ML techniques is presented [32]. ML techniques are often applied in TDC applications to reduce human effort and can be divided into two primary types: supervised and unsupervised. The main difference between the two types is that unsupervised ML-based TDC does not require a training process for learning how to classify text into proper categories, whereas supervised ML based TC needs a gold standard for training the classifier. Different algorithms have been used for supervised ML-based TDC, such as NB, KNN, and SVM.
Table (19) suggests that in comparison with other models, the proposed model has a higher accuracy than ML techniques; that Bagging+ RF has the highest F-Measure, and that the models Bagging+ RF and RS+RF have the highest AUC. In the experimental analysis, five statistical keyword extraction methods are taken into account. These methods include most frequent based keyword extraction, term frequency-inverse sentence frequency (TF-ISF) [32] based keyword extraction, co-occurrence statistical information based keyword extraction (CSI) [32], eccentricity-based keyword extraction (EB) [32] and Text Rank algorithm based keyword extraction (TR).
Bagging [32] is one of classes of machine learning which helps to build a strong/improved composite classifier with high predictive efficiency by combining the classifiers trained on different training sets. In this method, each weak learning algorithm is trained on a different training set obtained by a substitution from the training set, where sizes of samples are kept equal to the size of the main training set. For obtain new training sets, the simple random sampling with substitution is utilized. This method yields the diversity required for the ensemble learning. The results of the individual classifiers are combined by majority voting or weighted majority voting. Voting [32] is the simplest form of combining the base learning algorithms. There are several ways to combine the outputs of base classification algorithms. These fusion methods include majority voting, weighted majority voting, NB hybrid rule, behavioral knowledge space method, and probabilistic approximation. In the simple majority voting, the binary outputs of the k base classification algorithms are combined such that the class with the highest number of votes is determined as the output of the ensemble.
TABLE 23 COMPARISON OF THE PROPOSED MODEL WITH ML MODELS ON REUTERS-21578 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The NB showed the better performance than SVM and LR in the experiments and the accuracy of the proposed model is equal to %95.41. The NB classifier has several strong points related to its simplicity and demand for small amount of training data. NB is one of simplest techniques that construct classifiers based on the basic and strong probability theory. Despite its naive design and assumption, NB classifiers have worked quite well in many complex real-world situations.
5.6.2. NB-K-Means Model
In Table (24), the results of the proposed model are compared with NB-K-Means model [16] on the datasets Reuters-21578, WebKb, and Cade 12 according to accuracy factor. Table (24) shows that the proposed model has a higher accuracy and that is due to its selection of effective features in classification.
TABLE 24 COMPARISON OF THE PROPOSED MODEL WITH NB-K-MEANS MODEL | |||||||||||||||
|
Table (25) shows the F-Measure scores that were obtained on Reuters-21578 dataset with SVM and NB classifiers [33]. According to Table (25), IGFSS method surpasses the individual performances of three different global feature selection methods in terms of Accuracy.
TABLE 25 F-MEASURE SCORES (%) FOR REUTERS DATASET USING (A) SVM (B) NB [33] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Table (26) shows the F-Measure scores that were obtained on WebKb dataset with SVM and NB classifiers [33].
TABLE 26 F-MEASURE SCORES (%) FOR WEBKB DATASET USING (A) SVM (B) NB [33] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Information Gain (IG) scores show the contribution ratio of the presence or absence of a term to correct classification of text documents. IG assigns a maximum value to a term if it is a good indicator for assigning the document to any class. As it is indicated Equation (14), IG is a global FS metric as producing only one score for any term t and this score is calculated according to Equation (14) [33].
In Equation (14), P(Ci) express the probability of class Ci, M is the number of classes, P(t) and P(ť) are the probabilities of presence and absence of term t, P(Ci|t) and P(Ci|ť) are the conditional probabilities of class Ci given presence and absence of term t, respectively. Gini index (GI) is a global FS method for TDC which can be used as an improved type of an FS algorithm used in decision tree construction. It has a simple formulation which is defined by the following equation (15) [33].
| (15) |
In Equation (15), P(t|Ci) is the probability of term t given presence of class Ci, P(Ci|t)is the probability of class Ci given presence of term t, respectively. DFS is one of the most efficient FS algorithms for TDC and is also a global FS metric. The idea behind DFS is to select distinctive features while eliminating uninformative ones considering some predetermined criteria. DFS is defined according to Equation (16) [33].
| (16) |
In Equation (16), M is the number of classes, P(Ci|t) is the conditional probability of class Ci given presence of term t, P(t|Ci) is the conditional probability of absence of term t given class Ci, and P(t|C’i) is the conditional probability of term t given all the classes except Ci.
IG [34] is one of the popular approaches employed as a term importance criterion in the text document data. The idea is based on information theory. Before dimension reduction, each term within the text is ranked depending on their importance for the classification in decreasing order using the IG method. The experimental results with the KNN and C4.5 decision tree classifier are summarized in Table (27).
TABLE 27 THE COMPARISON OF KNN AND C4.5 WITH PROPOSED MODEL BASED ON IG ON REUTERS-21,578 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
TABLE 28 THE COMPARISON OF KNN AND C4.5 WITH PROPOSED MODEL BASED ON IG-GA ON REUTERS-21,578 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Table (28) shows the detailed comparison results. From these results, we can see that our proposed model significantly outperform KNN and are even much better than C4.5. The GA is an optimization method mimicking the evolution. This algorithm, which is an effective optimization method in wide search spaces, is preferred because it is the appropriate method for the solution of the problem. Although, terms of high importance in documents are acquired through IG method, but main problem is the high dimensionality of the feature space. Since given a feature set U via IG method is high dimensionality, it is impractical to evaluate all the possible subsets of U. Due to this deficiency GA-based FS method is adopted in [34]. Accordingly, GA is used to provide near-optimal solutions for FS. The objective of the GA-based FS is to find the optimal subset of a given feature set U that maximizes classification performance in [34].
In [35] has been proposed and explored a novel discriminative and semantic FS method for text categorization. The proposed method first selects features with strong discriminative power and then considers the semantic similarity between features and documents. The FS is tested using SVM classifier upon two datasets (Reuters-21578 and 20-Newsgroups [34]). In this type of model, a document is represented as a feature vector whose components are the term weights, dk=(w1k,w2k,…,wik,…,wnk), where wik is the weight of term ti in document dk. In this method, features are selected in documents based on a scale of discriminative power, and also on a measure of the similarity between features and the similarity between features and documents independent of the external information sources. To transform all documents into feature vectors using the selected features, and these vectors form the input data for the SVM. The SVM is used to evaluate the usefulness of the FS method. The comparisons involve five FS methods, which include the χ2 statistic, IG, and mutual information (MI). The other two are incorporated in the proposed method, i.e., the discriminative feature selection method (DFS), and the discriminative and semantic FS method (DFS+ Similarity).
TABLE 29 PERFORMANCE COMPARISON WITH DIFFERENT NUMBER OF FEATURES ON (A) REUTERS-21578 AND (B) 20 NEWSGROUPS [35] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The Chi Square (χ2) statistic: This method, a feature is selected according to their correlation with a category. The χ2 statistic measures the lack of independence between t and c and can be compared to the χ2 distribution with one degree of freedom to judge extremeness. The statistic is defined in Equation (17) [35].
| (17) |
In Equation (17), N is the total number of documents, aij is the number of documents that contain feature ti in category cj, bij is the number of documents that do not contain feature ti in category cj, cij is the number of documents that contain feature ti but do not belong to category cj, and dij is the number of documents that do not contain feature ti and do not belong to category cj. MI is a metric of the correlation between signals, and can be used to identify the features relevant to a particular category, as in Equation (18) [35].
| (18) |
DFS: The main objectives of the DFS method consist of (i) selecting features with a higher average term frequency in cj, because these features have a high probability in representing category cj; (ii) selecting features with a higher occurrence rate in most of the documents in cj, because these features have a high probability in representing category cj; and (iii) ignoring features occurring in most of the documents in cj and cj, because these features have a weak discriminative ability between categories.
| (19) |
In Equation (19), and represent the term frequency of feature ti in category cj and in category, respectively, and represent the number of documents containing feature ti in category cj and in category respectively.
TF-IDF: TF-IDF [35] is the most popular term weighting scheme in information retrieval.
| (20) |
In Equation (20), n is the number of chosen features, tf (tm,dk) is the term frequency of feature tm in document dk, and nm is the number of documents that contain feature tm.
VI. Conclusion and Future Works
In this paper, we used a hybrid of IWO algorithm and NB classifier for TDC. We used IWO for selecting important features and NB for document classification based on training and testing. Results indicate that the proposed model is more accurate in comparison with NB classifier. In addition, error rate factor indicates that the errors of the proposed model with FS are less. Comparison of the proposed model with other models indicated that the proposed model is more accurate because of using FS and is able to explore the features space better. The error rate of the proposed model with FS on the datasets Reuters-21578, WebKb, and Cade 12 is 0.0313, 0.0353, and 0.0386 respectively. For future studies, and enhancement of the proposed model one can use a hybrid of the operators of metaheuristic algorithms for selecting the optimum solution.
References
[1] W. Hadi, Q.A. Al-Radaideh, S. Alhawari, Integrating associative rule-based classification with Naïve Bayes for text classification, Applied Soft Computing, Vol. 69, pp. 344-356, 2018.
[2] D. Mahata, R.R. Shah, J. Kuriakose, R. Zimmermann, J.R. Talburt, Theme-Weighted Ranking of Keywords from Text Documents Using Phrase Embeddings, IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), IEEE, pp. 184-189, 2018.
[3] A. Kulkarni, V. Tokekar, P. Kulkarni, Discovering Context of Labeled Text Documents Using Context Similarity Coefficient, Procedia Computer Science, Vol. 49, pp. 118-127, 2015
[4] K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from TF-IDF to TF-IGM for term weighting in text classification, Expert Systems with Applications, Vol. 66, pp. 245-260, 2016.
[5] S. Ramanna, J.F. Peters, C. Sengoz, Application of Tolerance Rough Sets in Structured and Unstructured Text Categorization: A Survey, Thriving Rough Sets, Springer, Vol. 708, pp. 119-138, 2017.
[6] A.R. Mehrabian, C. Lucas, A novel numerical optimization algorithm inspired from weed colonization, Ecol. Inform. 1(4): 355-366, 2006.
[7] A. McCallum, K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification, In AAAI-98 workshop on learning for text categorization, Vol. 752, pp. 41-48, 1998.
[8] X. Deng, Y. Li, J. Weng, J. Zhang, Feature selection for text classification: A review, Multimedia Tools and Applications, pp. 1-20, 2018.
[9] M. Rogati, Y. Yang, High-performing variable selection for text classification, in: CIKM ’02 Proceedings of the 11th International Conference on Information and Knowledge Management, pp. 659-661, 2002.
[10] Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: The Fourteenth International Conference on Machine Learning (ICML97), pp. 412-420, 1997.
[11] J. Holland, Adaptation in Natural and Artificial Systems, University of Michigan, Michigan, USA, 1975.
[12] J. Kennedy, R. C. Eberhart, Particle Swarm Optimization, In Proceedings of the IEEE International Conference on Neural Networks, pp. 1942-1948, 1995.
[13] A. Trstenjak, S. Mikac, D. Donko, KNN with TF-IDF based Framework for Text Categorization, Procedia Engineering, Vol. 69, pp. 1356-1364, 2014.
[14] Y. Ko, J. Seo, Text classification from unlabeled documents with bootstrapping and feature projection techniques, Information Processing & Management, Vol. 45, Issue 1, pp. 70-83, 2009
[15] D. Ghasempour, F.S.Gharehchopogh, A New Approach for Feature Selection in Text Documents Classification by Using Hybrid Model of Bat and K-Nearest Neighborhood Algorithms, Islamic Azad University, Urmia Branch, Thesis, Summer 2016.
[16] A. Allahvirdipour, F.S. Gharehchopogh, New Approach in Features Selection in Text Documents Classification using the Hybrid Model Algorithms of Naive Bayes and K-Means, Islamic Azad University, Urmia Branch, Thesis, Spring 2016.
[17] R. Habibpour, K. Khalilpour, A New Hybrid K-means and K-Nearest-Neighbor Algorithms for Text Document Clustering, International Journal of Academic Research, Vol. 6 Issue 3, pp. 79-84, 2014
[18] M. Karabulut, Fuzzy unordered rule induction algorithm in text categorization on top of geometric particle swarm optimization term selection, Knowledge-Based Systems, Vol. 54, pp. 288-297, 2013.
[19] A.K. Uysal, S. Gunal, Text classification using genetic algorithm oriented latent semantic features, Expert Systems with Applications, Vol. 41, Issue 13, pp. 5938-5947, 2014
[20] T. Wei, Y. Lu, H. Chang, Q. Zhou, X. Bao, A semantic approach for text clustering using WordNet and lexical chains, Expert Systems with Applications, Vol. 42, Issue 4, pp. 2264-2275, 2015
[21] W. Zhang, X. Tang, T. Yoshida, TESC: An approach to TExt classification using Semi-Supervised Clustering, Knowledge-Based Systems, Vol. 75, pp.152-160, 2015
[22] K.K. Bharti, P.K. Singh, Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering, Applied Soft Computing, Vol. 43, pp. 20-34, 2016.
[23] D. AbuZeina, F.S. Al-Anzi, Employing fisher discriminant analysis for Arabic text classification, Computers & Electrical Engineering, in press, corrected proof, Available online 10 November 2017.
[24] R. Wongso, F.A. Luwinda, B.C. Trisnajaya, O. Rusli, Rudy, News Article Text Classification in Indonesian Language, Procedia Computer Science, Vol. 116, pp. 137-143, 2017.
[25] H.P. Luhn, A Statistical Approach to the Mechanized Encoding and Searching of Literary Information, IBM Journal of Research and Development, Vol. 1, No. 4, pp. 309-317, 1957.
[26] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989.
[27] R.S. Michalski, I. Bratko, M. Kubat, Machine Learning and Data Mining: Methods and Applications, New York: Wiley, 1998.
[28] D. Francois, Binary classification performances measure cheat sheet, 2009.
[29] C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/?mlearn/MLRepository.html], University of California. Department of Information and computer science, Irvine, CA, 1998, pp. 55
[30] http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
[31] http://ana.cachopo.org/datasets-for-single-label-text-categorization
[32] A. Onana, S. Korukoglub, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Systems with Applications, Vol. 57, pp. 232-247, 2016.
[33] A.K. Uysal, An improved global feature selection scheme for text classification, Expert Systems with Applications, Vol. 43, pp. 82-92, 2016.
[34] H. Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems, Vol. 24, Issue 7, pp. 1024-1032, 2011.
[35] W. Zong, F. Wu, L.K. Chu, D. Sculli, A Discriminative and Semantic Feature Selection Method for Text Categorization, International Journal of Production Economics, Vol. 165, pp. 215-222, 2015.
[36] C. Veenhuis, Binary Invasive Weed Optimization, Second World Congress on Nature and Biologically Inspired Computing (NaBIC), pp. 449-454, 2010.
[37] L.M. Abualigah, A.T. Khader, Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering, The Journal of Supercomputing, Vol. 73, Issue 11, pp. 4773-4795, 2017.