|dc.description.abstract||Inherent in any multivariate data is structure, which describes the general shape and distribution of the underlying point configuration. While there are potentially many types of structure that could be of interest, consider restricting interest to two general types: geometric structure, the general shape of a point configuration, and probabilistic structure, the general distribution of points within the configuration.
The ability to quantify geometric structure is an important step in many common statistical analyses. For instance, general neighbourhood structure is captured using a k-nearest neighbour graph in dimension reduction techniques such as isomap and locally-linear embedding. Neighbourhood graphs are also used in sensor network localization, which has applications in fields such as environmental habitat monitoring and wildlife monitoring. Another geometric graph, the convex hull, is also used in wildlife monitoring as a rough estimate of an animal's home range.
The identification of areas of high and low density is one example of measuring the probability structure of a configuration, which can be done using a wide variety of methods. One such method is using kernel density estimation, which can be viewed as a weighted sum of nearby points. Kernel density estimation has widely varying applications, including in regression analysis, and is used in general to assess certain features of the data (modality, skewness, etc.).
Related to the idea of measuring structure is the concept of "Cognostics", which has been formalized as scatterplot diagnostics (or scagnostics). Scagnostics provides a framework through which interesting structure can be measured in a configuration. The central idea is to numerically summarize the structure of a large number of two-dimensional point configurations via measures calculated on geometric graphs. This allows the interesting views to be quickly identified, and ultimately examined visually, while the views deemed to be uninteresting are simply discarded. While a good starting point, several issues in the current framework need to be addressed. For instance, while each measure is designed to be in [0,1], there are some that, when measured over tens of thousands of configurations, fail to achieve this range. In addition, there is a lot of structure that could be considered interesting that is not captured by the current framework. These issues, among others, will be addressed and rectified so that the current scagnostic framework can continue to be built upon.
With tools to measure structure, attention is turned to making use of the structural information contained in the configuration. Consider the problem of preserving measured structure during the task of data aggregation, more commonly known as binning. Existing methods of data aggregation tend to exist on two ends of the structure retention spectrum. Through experimentation, methods such as equal width and hexagonal binning will be shown to tend to retain the shape of the configuration, at the expense of the density, while methods such as equal frequency and random sampling tend to retain relative density at the expense of overall shape. Tree-based binning, a general binning framework inspired by classification and regression trees, is proposed to bridge the gap between these sets of specialist algorithms. GapBin, a specially designed tree-based binning algorithm, will be shown through experimentation to provide a trade-off in low dimensional space between geometric structure retention and probabilistic structure retention. In higher dimensions, it will be shown to be the superior algorithm in terms of structure retention among those considered.
Next, the general problem of constructing a configuration with a given underlying structure is considered. For example, the minimal spanning tree is known to carry important clustering information. Of interest then, is the generation of configurations with a given minimal spanning tree structure. The problem of generating a configuration with a known minimal spanning tree is equivalent to completing a Euclidean distance matrix where the only known entries are those in the minimal spanning tree. For this problem, there are several solutions, including those of Alfakih et. al., Fang & O'Leary, and Trosset. None of these algorithms, however, are designed to retain the structure of the minimal spanning tree. In addition, the sparsity of the Euclidean distance matrix containing only the minimal spanning tree results in completions that are not accurate as compared to the known completion. This leads to issues in the point configurations of the resulting completions. To resolve these, two new algorithms are proposed which are designed to retain the structure of the minimal spanning tree, leading to more accurate completions of these sparse matrices.
To complement the algorithms presented, implementation of these algorithms in the statistical programming language R will also be discussed. In particular, the R package treebinr for tree-based binning, and edmcr for Euclidean distance matrix completions will be presented.||en