TY - JOUR
TI - Composite boolean separators for data analysis with applications in computed tomography and gene expression microarray data
DO - https://doi.org/doi:10.7282/T3MG7PXJ
PY - 2007
AB - An important topic in machine-learning / data-mining is that of analyzing binary datasets.
A binary dataset consists of a subset of n-vectors (observations) with binary components, each of which has an associated binary outcome (the class of the observation). Clearly, the set of n-vectors and their outcomes represent a partially defined Boolean function.
The central problem of machine-learning / data-mining, the so-called classification problem, consists in finding an "extension" of the partially defined Boolean function closely approximating a hidden ("target") function. Various methods have been developed to solve this and related problems, such as identifying misclassified observations, revealing irrelevant and/or redundant variables, etc.
In this thesis, we propose a new approach to analyzing different problems in machine-learning / data-mining. First, we define a simple procedure for generating artificial Boolean variables, called Composite Boolean Features, and describe an iterative algorithm for generating Boolean functions which agree with the outcomes in a large proportion of the observations in the dataset. We call these functions Composite Boolean Separators (CBSes for short). We then use the idea of CBSes in several ways. In particular, we demonstrate the usefulness of these concepts by showing how the introduction of CBSes can enhance the accuracy of classification systems; we employ CBSes for identifying misclassified observations and examine how deletion of such observations and reversal of their class influence the classification accuracy; we apply the new variables to the attribute selection problem, i.e., to the problem of finding "good" (informative) subsets of the original attributes, or equivalently, identifying "bad" (irrelevant and/or redundant) attributes in the given datasets.
All the results have been tested on eight publicly available datasets and validated by five well-known machine-learning / data-mining techniques. Also, we applied CBSes, along with other techniques, to the analysis of two real-life medical datasets: computed tomography data and breast cancer gene expression microarray data.
The results presented in this thesis demonstrate that for many real-life datasets, the application of CBSes increases the classification accuracy significantly. CBSes also prove useful in the missclassification and attribute selection problems.
KW - Operations Research
KW - Tomography
KW - Data mining
KW - Gene expression
LA - English
ER -