dc.description.abstract | This thesis concerns the development and mathematical analysis of statistical procedures for
classification problems. In supervised classification, the practitioner is presented with the
task of assigning an object to one of two or more classes, based on a number of labelled
observations from each class. With modern technological advances, vast amounts of data can
be collected routinely, which creates both new challenges and opportunities for statisticians.
After introducing the topic and reviewing the existing literature in Chapter 1, we investigate
two of the main issues to arise in recent times.
In Chapter 2 we introduce a very general method for high-dimensional classification,
based on careful combination of the results of applying an arbitrary base classifier on random
projections of the feature vectors into a lower-dimensional space. In one special case that
we study in detail, the random projections are divided into non-overlapping blocks, and
within each block we select the projection yielding the smallest estimate of the test error.
Our random projection ensemble classifier then aggregates the results after applying the
chosen projections, with a data-driven voting threshold to determine the final assignment.
We derive bounds on the test error of a generic version of the ensemble as the number of
projections increases. Moreover, under a low-dimensional boundary assumption, we show that
the test error can be controlled by terms that do not depend on the original data dimension.
The classifier is compared empirically with several other popular classifiers via an extensive
simulation study, which reveals its excellent finite-sample performance.
Chapter 3 focuses on the k-nearest neighbour classifier. We first derive a new global
asymptotic expansion for its excess risk, which elucidates conditions under which the dominant
contribution to the risk comes from the locus of points at which each class label is
equally likely to occur, as well as situations where the dominant contribution comes from the
tails of the marginal distribution of the features. The results motivate an improvement to the
k-nearest neighbour classifier in semi-supervised settings. Our proposal allows k to depend
on an estimate of the marginal density of the features based on the unlabelled training data,
using fewer neighbours when the estimated density at the test point is small. We show that
the proposed semi-supervised classifier achieves a better balance in terms of the asymptotic
local bias-variance trade-off. We also demonstrate the improvement in terms of finite-sample
performance of the tail adaptive classifier over the standard classifier via a simulation study. | |