New approaches to modern statistical classification problems
This thesis concerns the development and mathematical analysis of statistical procedures for classification problems. In supervised classification, the practitioner is presented with the task of assigning an object to one of two or more classes, based on a number of labelled observations from each class. With modern technological advances, vast amounts of data can be collected routinely, which creates both new challenges and opportunities for statisticians. After introducing the topic and reviewing the existing literature in Chapter 1, we investigate two of the main issues to arise in recent times. In Chapter 2 we introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier on random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into non-overlapping blocks, and within each block we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results after applying the chosen projections, with a data-driven voting threshold to determine the final assignment. We derive bounds on the test error of a generic version of the ensemble as the number of projections increases. Moreover, under a low-dimensional boundary assumption, we show that the test error can be controlled by terms that do not depend on the original data dimension. The classifier is compared empirically with several other popular classifiers via an extensive simulation study, which reveals its excellent finite-sample performance. Chapter 3 focuses on the k-nearest neighbour classifier. We first derive a new global asymptotic expansion for its excess risk, which elucidates conditions under which the dominant contribution to the risk comes from the locus of points at which each class label is equally likely to occur, as well as situations where the dominant contribution comes from the tails of the marginal distribution of the features. The results motivate an improvement to the k-nearest neighbour classifier in semi-supervised settings. Our proposal allows k to depend on an estimate of the marginal density of the features based on the unlabelled training data, using fewer neighbours when the estimated density at the test point is small. We show that the proposed semi-supervised classifier achieves a better balance in terms of the asymptotic local bias-variance trade-off. We also demonstrate the improvement in terms of finite-sample performance of the tail adaptive classifier over the standard classifier via a simulation study.