ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data

View/ Open
Author
Cano, Alberto
Nguyen, Dat T.
Ventura Soto, S.
Cios, Krzysztof J.
Date
2015-10-15Subject
Supervised discretizationClass-attribute interdependency maximization
Unbalanced data
Classification
METS:
Mostrar el registro METSPREMIS:
Mostrar el registro PREMISMetadata
Show full item recordAbstract
Supervised discretization is one of basic data preprocessing
techniques used in data mining. CAIM (Class-
Attribute InterdependenceMaximization) is a discretization
algorithm of data for which the classes are known. However,
new arising challenges such as the presence of unbalanced
data sets, call for new algorithms capable of handling them,
in addition to balanced data. This paper presents a new discretization
algorithm named ur-CAIM, which improves on
the CAIM algorithm in three important ways. First, it generates
more flexible discretization schemes while producing
a small number of intervals. Second, the quality of the intervals
is improved based on the data classes distribution,
which leads to better classification performance on balanced
and, especially, unbalanced data. Third, the runtime of the
algorithm is lower than CAIM’s. The algorithm has been
designed free-parameter and it self-adapts to the problem
complexity and the data class distribution. The ur-CAIM
was compared with 9 well-known discretization methods
on 28 balanced, and 70 unbalanced data sets. The results
obtained were contrasted through non-parametric statistical
tests, which show that our proposal outperforms CAIM and
many of the other methods on both types of data but especially
on unbalanced data, which is its significant advantage.