A new approach for data cleaning to improve quality of data warehouse
Subject Areas : Multimedia Processing, Communications Systems, Intelligent SystemsAli Shahnavaz 1 , Mehdi Afzali 2 , Shima Rahimzadeh 3
1 - Department of Mathematics and Statistics, Islamic Azad University, Zanjan Branch, Zanjan, Iran,
2 - Department of IT Engineering, Islamic Azad University, Zanjan Branch, Zanjan, Iran,
3 - Zanjan University of Medical Sciences, Zanjan, Iran
Keywords: Data Quality, Data management, data cleaning, Data Preparation, Data warehouse,
Abstract :
Data management provides a tool that the information organization needs will be answered based on that properly. The most important issue in business intelligence is data quality. Data quality can guarantee data cleaning before uploading it to the data warehouse. Data cleaning is a procedure which includes the process of errors detection and correction and inconsistencies in the data warehouse. Because of the huge number of data in databases many problems and contradictions have been emerged. The main goal of this study is to remove inconsistencies in the databases in order to clean up the dirty data. A new approach with the purpose of improving the quality of data warehouse for correct decisions has been provided. For testing the proposed approach, data collection of student health certificate were used. Through the implementation of this approach we have been able to detect dirty data and then with using students’ national code, the correction process has been applied to them. Based on the achieved results, the amount of dirty data decreased from %25.79 to %4.97.
[1] Golfarelli, Matteo “New Trends in Business Intelligence”. Proceedings of the 28th International Convention MIPRO (BIS&DE&ISS). MIPRO (May 30-June /2005), Opatija, Croatia. PP: (15-20)
[2] NEDELCU, Bogdan. “Business Intelligence Systems”. Database Systems Journal. Vol IV. no.4(2013). PP: (12-20)
[3] Inmon, William H. “Building the Data Warehouse”. Wiley Publishing. 4th Edition. (2005). P:32
[4] Ghosh, Ranak; Halder, Sujay; Sen, Soumya. “An Integrated Approach to Deploy Data Warehouse in Business Intelligence Environment”. IEEE Third International Conference. (7-8 Feb- 2015), Hooghly). PP: (1 – 4)
[5] Talebzadeh, Hossein. A Service-Based Framework for ETL Process Based on Metadata. Journal of Basic and Applied Scientific Research. (2/1/ 2012). PP:( 54-59)
[6] Gill, Rupali; Singh, Jaiteg. “A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment”. Journal on Today’s Ideas Tomorrow’s Technologies. Vol. 2. No. 2. (19 December 2014). PP: (153_160)
[7] Choudhary, Nidhi. “A Study over Problems and Approaches of Data Cleansing/Cleaning”, International Journal of Advanced Research in Computer Science and Software Engineering. Vol. 4, Issue 2. (February 2014). PP: ( 774 _779)
[8] Bhattacharjee, Arup Kumar ; Chatterjee, Partha ; Prasad Shaw, Mukesh ; Chakraborty, Manomoy. “ETL based Cleaning on Database”, International Journal of Computer Applications.Vol.105, No. 8. (November 2014). PP: (34– 40)
[9] Miglani, Sakshi; Dr. Gupta, Neha. “An Overview On Evocations Of Data Quality at ETL Stage”, International Journal of Advanced Technology in Engineering and Science. Vol. No. 03. Special Issue No. 01. (March 2015). PP: ( 1429- 1436)
[10] Taneja, Shweta; Ashri, Ishita; Gupta, Shipra; Sharma, Mehak. DFT: “A Novel Algorithm for Data Cleansing”, International Journal of Computer Science and Information Technologies. Vol..5 (2). (2014). PP: ( 2297- 2301)
[11] Devi, Sapna; Dr. Kalia, Arvind. “Study of Data Cleaning & Comparison of Data Cleaning Tools”, International Journal of Computer Science and Mobile Computing. Vol. 4. Issue. 3. (March 2015). PP: (360 – 370)
[12] Porwal, Sonal; Vora, Deepali. “A Comparative Analysis of Data Cleaning Approaches to Dirty Data”. International Journal of Computer Applications. Vol. 62, No.17. (January 2013). PP: ( 30- 34)
[13] Sheoran, Jyoti. “Issues of Data Quality in Data Warehouses”, International Conference on Advances in Computer Engineering & Applications (ICACEA-2014 at IMSEC.GZB). PP: (6 – 8)
[14] Varol, Cihan; Bayrak, Coskun; Wagner, Rick; Goff, Dana. “Application of the Near Miss Strategy and Edit Distance to handle Dirty Data”, International Series in Operations Research & Management Science. Springer US. Vol. 32, (2010). PP: (91 -101)
[15] Ning Li, Wing; Bheemavaram, Roopa; Zhang, Johnson. “Transitive Closure of Data Records”, Application and Computation. International Series in Operations Research & Management Science. Springer US. vol. 132. (2010). PP: (39-75)
[16] Dr. M. Hamad, Mortadha; Jihad, Alaa Abdulkhar. “An Enhanced Technique to clean Data in the Data Warehouse”. Developments in E-system Engineering (DeSE). IEEE International Conference on (6-8 Dec. 2011). Dubai. PP: (306-311)
[17] Paul, Arindam; Ganesan, Varuni; Challa, Jagat Sesh; Sharma, Yashvardhan. ” HADCLEAN: A Hybrid Approach to Data Cleaning in Data Warehouses”, Information Retrieval & Knowledge Management (CAMP). IEEE International Conference (13-15 March 2012). Kuala Lumpur. PP: (136- 142)
[18] Kulkarni, Prerna S; Bakal, J.W. “Hybrid Approaches for Data Cleaning in Data Warehouse”. International Journal of Computer Applications. Vol 88 , No.18. (February 2014). PP: (7-10)
[19] M. Save, Ashwini; Kolkur, Seema.” Hybrid Technique for Data Cleaning”. National Conference on Role of Engineers (in Nation Building. 2014. NCRENB-14). PP: (4-8)
_||_