Using data mining techniques to discover systematic biases in missing data

Using data mining techniques to discover systematic biases in missing data Conference

Tremblay, MC, Dutta, K, VanderMeer, D. (2008). Using data mining techniques to discover systematic biases in missing data . 133-138.

Business intelligence tools, including data warehousing and OLAP, aid decision-makers by facilitating the exploration of data aggregated from multiple sources. In this context, missing data is an important and known problem, since it can seriously affect the accuracy of conclusions drawn. Researchers have described several approaches for dealing with missing data, primarily attempting to infer values or estimate the impact of missing data on conclusions. However, few have considered approaches to characterize bias in missing data, i.e., to determine the specific attributes that predict the missingness of data values. Knowledge of the specific systematic bias patterns in the missing data can help analysts more accurately assess the quality of conclusions drawn from data sets with missing data. This research proposes a methodology to combine a number of Knowledge Discovery and Data Mining techniques, including association rule mining, to discover patterns in related attribute values that help characterize these biases. We demonstrate the efficacy of our proposed approach by applying it on a demo census dataset seeded with biased missing data. The experimental results show that our approach was able to find seeded biases, and filter out most seeded noise.