Document Analysis And Classification Based On Passing Window

BAMASOOD, ZAHER

Manuscript ID : JACET-1803-1152 (R3) Visit : 196 Page: 39 - 46

Article Type: Original Research

Document Analysis And Classification Based On Passing Window

Subject Areas : Data Mining

ZAHER BAMASOOD ^{1
*}

1 - Computer Science Department, Hadhramout University

Received: 2018-03-04 Accepted : 2019-08-01 Published : 2020-02-01

Keywords: Feature Extraction, segmentation, data mining, Information Retrieval, Document Image Analysis,

Abstract :

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorithm is proposed to segment a document image into homogenous regions. In document classification, Neural Network (Multilayer Perceptron- Back propagation) classifier is applied to classify each region to text or non text based on a number of features extracted in feature extraction. These features are collected from different other researchers’ works. Experiments were conducted on 398 document images selected randomly from printed Arabic text database (PATDB) which was selected from various printing forms which are advertisements, book chapters, magazines, newspapers, letters and reports documents. As results, the proposed segmentation algorithm achieved only 0.814% as ratio of the overlapping areas of the merged zones to the total size of zones and 1.938% as the ratio of missed areas to total size of zones. The features, that show the best accuracy individually, are Background Vertical Run Length (RL) Mean, and Standard Deviation of foreground.

References:

1. H. Jaekyu, R. M. Haralick, and I. T. Phillips, “Recursive X-Y cut using bounding boxes of connected components,” in Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2 (ICDAR ’95), 1995, vol. 2, pp. 952–955.

2. G. Nagy, S. Seth, and M. Viswanathan, “A prototype document image analysis system for technical journals,” Computer (Long. Beach. Calif)., vol. 25, no. 7, pp. 10–22, Jul. 1992.

3. J. Liu, Y. Y. Tang, and C. Y. Suen, “Chinese document layout analysis based on adaptive split-and-merge and qualitative spatial reasoning,” Pattern Recognit., vol. 30, no. 8, pp. 1265–1278, Aug. 1997.

4. J. Liang, I. T. Phillips, J. Ha, and R. M. Haralick, “Document Zone Classification Using Sizes of Connected-components,” in Document Recognition III, SPIE’96, 1996, pp. 150–157.

5. N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” IEEE Trans. Syst. Man Cybern., vol. 9, no. 1, pp. 62–66, 1979.

6. S. A. Mahmoud, “Pergamon Arabic Character Recognition Using Fourier Descriptors and Character Contour Encoding,” Pattern Recoonition, vol. 27, no. 6, pp. 815–824, 1994.

7. S. Inglis and I. H. Witten, “Document Zone Classification Using Machine Learning,” Proc. Digit. Image Comput. Tech. Appl., 1995.

8. Y. Wang, I. T. Phillips, and R. M. Haralick, “Document zone content classification and its performance evaluation,” Pattern Recognit., vol. 39, no. 1, pp. 57–73, Jan. 2006.

9. A. G. AL-Hashim, “Arabic database for automatic printed Arabic text recognition research and benchmarking,” MSc Thesis. KFUPM, Dhaharan, Saudi Arabia, 2009.

10. A. G. Al-Hashim and S. A. Mahmoud, “Benchmark Database and GUI Environment for Printed Arabic Text Recognition Research,” WSEAS Trans. Inf. Sci. Appl., vol. 7, no. 4, pp. 587–597, 2010.

11. A. G. Al-Hashim and S. A. Mahmoud, “Printed Arabic Text Database ( PATDB ) for Research and Benchmarking,” in Proceedings of the 9th WSEAS international conference on Applications of Computer Engineering, 2010, pp. 62–68.

12. F. Shafait, D. Keysers, and T. M. Breuel, “Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 6, pp. 941–954, 2008.

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
Print Date : 2015-05-01
Classifier Ensemble Framework: a Diversity Based Approach
Print Date : 2016-09-01
Experimental Evaluation of Algorithmic Effort Estimation Models using Projects Clustering
Print Date : 2016-08-01
Improvement of effort estimation accuracy in software projects using a feature selection approach
Print Date : 2016-12-01
Developing A Fault Diagnosis Approach Based On Artificial Neural Network And Self Organization Map For Occurred ADSL Faults
Print Date : 2017-08-01
Sports Result Prediction Based on Machine Learning and Computational Intelligence Approaches: A Survey
Print Date : 2019-02-01

Share To

Article Url

Document Analysis And Classification Based On Passing Window