Application of Framework for Data Cleaning to Handle Noisy Data in Data Warehouse
A.F. Elgamal*1, N.A. Mosa2, N.A. Amasha3
1A.F. Elgamal, Ass. Professor, Department of Computer Science, Mansoura University, Egypt.
2N.A. Mosa, Lecturer, Department of Computer Science, Mansoura University, Egypt.
3N.A. Amasha, Instractor, Department of Computer Science, Mansoura University, Egypt
Manuscript received on December 08, 2014. | Revised Manuscript received on December 15, 2014. | Manuscript published on January 05, 2014. | PP: 226-231 | Volume-3 Issue-6, January 2014. | Retrieval Number: F2029013614/2014©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Data cleaning is a complex process which makes use of several technology specializations to solve the contradictions taken from different data sources. In fact, it represents a real challenge for most organizations which need to improve the quality of their data. Data quality needs to be improved in data stores when there is an error in input data, abbreviations or differences in the archives derived from several data bases in one source. Therefore, data cleaning is one of the most challenging stages to clear repeated archives, because it deals with the detection and removal of errors, filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies to improve the quality of the data gathered from distributed sources. It is particularly crucial to extract a correct conclusion from data in decision support systems (DSS). This paper presents an application of general framework for the data cleaning process, which consists of six steps, namely selection of attributes, formation of tokens, selection of the clustering algorithm, similarity computation for the selected attributes, selection of the elimination function, and finally merge. A proposed software is developed with SQL Server 2010 and C# 2010.
Keywords: Data cleaning, Data quality, Data warehouse, Duplicate elimination.