Data Pre-Processing for Classification and ClusteringAuthor : S.Balamurugan and A.B.Arockia Christopher
Volume 1 No.1 January-June 2012 pp 1-5.
In real world datasets, lots of redundant and conflicting data exists. The performance of a classification algorithm in data mining is greatly affected by noisy information (i.e. redundant and conflicting data). These parameters not only increase the cost of mining process, but also degrade the detection performance of the classifiers. They have to be removed to increase the efficiency and accuracy of the classifiers. This process is called as the tuning of the dataset. The redundancy check will be performed on the original dataset and the resultant is to be preserved. This resultant dataset is to be then checked for conflicting data and if they will be corrected and updated to the original dataset. This updated dataset is to be then classified using a variety of classifiers like Multilayer perceptron, SVM, Decision stump, Kstar, LWL, Rep tree, Decision table, ID3, J48 and NaÃ¯ve Bayes. The performance of the updated datasets on these classifiers is to be found. The results will show a significant improvement in the classification accuracy when redundancy and conflicts are to be removed. The conflicts after correction are to be updated to the original dataset, and when the performance of the classifier is to be evaluated, great improvement is to be witnessed.
data mining, classification algorithm, redundancy, conflicting data