Using an Automatic Weighted Keywords Dictionary for Intelligent Web Content Filtering
Subject Areas : B. Computer Systems OrganizationNajibeh Farzi Veijouyeh 1 * , Jamshid Bagherzadeh 2
1 - Islamic Azad University of Shabestar Branch, Shabestar, Iran
2 - Assistant professor, Computer Science and Eng. Deptt, Urmia University, Urmia, Iran
Keywords: Forbidden keywords extraction, Ranking keywords, Web page representation, Content based filtering,
Abstract :
Filtering of web pages with inappropriate contents is one of the major issues in the field of intelligent network's security. Having a good intelligent filtering method with high accuracy and speed is needed for any country in order to control users' access to the web. So, it has been considered by many researchers. Presenting web pages in an understandable way by machines is one of the most important preprocessing steps. Thus, offering a way to describe web pages with lower dimensions would be very effective, especially in determining the nature of web pages with respect to whether they should be filtered out or not. In this paper, we propose an automatic method to detect forbidden keywords from web pages. Next, we define a new representation of web pages in vector form which consists of weighted sum and frequency of forbidden keywords in different parts of web pages named RWSF. For this, a ranking dictionary of keywords including forbidden keywords is used. To evaluate the proposed method, 2643 pages consisting of 1311 normal pages and 1332 forbidden pages were used. Among these, 1851 pages were used to train the system and 792 pages were used for system evaluation. The system has been assessed using various classifiers such as: k-Nearest Neighbor, Support Vector Machines, Decision Tree and Artificial Neural Networks. Evaluation results indicate the high efficiency and accuracy of the proposed method in all classifiers.