Imputation of Missing Values using Hybrid Approach with Association Rule and KMean Clustering Algorithm
Authors: Neelesh Shrival, Kapil Sahu
Certificate: View Certificate
Abstract
The data mining architecture works on facts and figures which are used for any type of decision making. To perform any analysis and decision making, these facts must be complete so that the analyst can make a strategy for decision making. In fact the most important problem in knowledge discovery is the missing values of the attributes of the Dataset. The presence of such imperfections usually requires a preprocessing stage in which the data are prepared and cleaned, in order to be useful, and sufficiently clear for the knowledge extraction process. In this thesis presenting the Comparative study of the different method employed for Imputation or Replacement of the missing values. These methods can work with text dataset, Boolean dataset and with numeric dataset. We have discussed the parametric, non-parametric and semiparametric imputation methods.
Introduction
With access to vast volumes of data, decision makers frequently draw conclusions from data repositories that may contain data quality problems, for a variety of reasons. In decision making, data quality is a serious concern. The incidence of data quality issues arises from the nature of the information supply chain [1], where the consumer of a data product may be several supply-chain steps removed from the people or groups who gathered the original datasets on which the data product is based. These consumers use data products to make decisions, often with financial and time budgeting implications. The separation of the data consumer from the data producer creates a situation where the consumer has little or no idea about the level of quality of the data [2], leading to the potential for poor decisionmaking and poorly allocated time and financial resources.
Conclusion
In this Thesis we have investigated the different techniques for missing value imputation and dimensionality reduction. We attempted to understand and find the suitable techniques for developing the model for analyzing the impact of missing instances in a dataset. Besides this, the key factor is to understand the nature of the dataset in order to choose the suitable technique. The important outcomes of this extensive study will help in choosing the appropriate techniques for missing data handling problems. Our results suggest that missing values imputation using our technique has good potential in term of accuracy and is also a good technique in term of processing time. In future we enhance this thing by merging some methods. Hope so they give more better results than this one.
Copyright
Copyright © 2025 Neelesh Shrival, Kapil Sahu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.