Essays on Empirical likelihood for Heaviness Estimation, Outlier Detection and Clustering

Loading...
Thumbnail Image

Date

2024-04-24

Authors

Zhang, Zhuojing

Advisor

Chen, Tao

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Empirical likelihood (EL) is a non-parametric likelihood method of inference. There are a large number of studies about the extensions and applications of EL. Most studies discuss the EL ratio for constructing confidence regions and testing hypotheses, while this thesis focuses on the EL weight assigned to each observation in the dataset by the EL ratio function. This thesis contains three chapters on studying the behaviour and application of EL weights. Specifically, chapter 1 provides a novel approach based on the EL weights to estimate a threshold that separates the bulk part and tail part of datasets of datasets with a heavy-tailed histogram. Because the transition between the bulk and tail parts can not be fully disjointed in many cases, we allow the threshold to be a random variable instead of a fixed number. In addition, the threshold is relative to a benchmark since heaviness is a relative concept. In Chapter 2, we focus on outlier detection. We develop an unsupervised method based on EL to identify outliers. In particular, we calculate the EL weights through the EL ratio function with the bootstrap mean constraint and show that the EL weights have different behaviours for datasets with and without outliers. Additionally, the EL weights provide a measure of outlierness for all observations, which might reduce the cost of time. In Chapter 3, I consider a clustering algorithm based on the EL weights. Clustering is an unsupervised method that aims to group unlabeled datasets based on their similarities. Numerous clustering methods have been proposed. The performance of these methods is typically related to the characteristics of the dataset in the specific applications. The proposed EL weights based clustering algorithm is available to work on datasets with outliers. Moreover, it might suggest the number of clusters for well-separated clusters.

Description

Keywords

LC Keywords

Citation

Collections