Probabilistic Noise Identification and Data Cleaning
by Jeremy Kubica and Andrew Moore
BibTeX:
@InProceedings{kubicaLENS,
author = "Jeremy Kubica and Andrew Moore",
title = "Probabilistic Noise Identification and Data Cleaning",
Booktitle = "The Third IEEE International Conference on Data Mining",
Month = "November",
Year = "2003",
Pages = "131--138",
Editor = "Xindong Wu and Alex Tuzhilin and Jude Shavlik",
Publisher = "IEEE Computer Society"
}
Abstract:
Real world data is never as perfect as we would like it to be and can often suffer from
corruptions that may impact interpretations of the data, models created from the data,
and decisions made based on the data. One approach to this problem is to identify and
remove records that contain corruptions. Unfortunately, if only certain fields in a record
have been corrupted then usable, uncorrupted data will be lost. In this paper we present
LENS, an approach for identifying corrupted fields and using the remaining non-corrupted
fields for subsequent modeling and analysis. Our approach uses the data to learn a
probabilistic model containing three components: a generative model of the clean records,
a generative model of the noise values, and a probabilistic model of the corruption process.
We provide an algorithm for the unsupervised discovery of such models and empirically evaluate
both its performance at detecting corrupted fields and, as one example application, the
resulting improvement this gives to a classifier.