Pattern Discovery and Disentanglement for Clinical Data Analysis

Zhou, Pei-Yuan

Pattern Discovery and Disentanglement for Clinical Data Analysis

Files

Zhou_Peiyuan.pdf (3.87 MB)

Date

2020-09-10

Authors

Zhou, Pei-Yuan

Advisor

Wong, Andrew

Publisher

University of Waterloo

Abstract

In recent years, machine learning approaches have important empirical successes on analysing data such as images, signals, texts and speeches with applications in biomedical and clinical areas. However, from the perspective of modelling, many machine learning methods still encounter crucial problems such as the lack of transparency and interpretability. Frequent Pattern Mining or Association Mining methods intend to solve the problem of interpretability, but they also encounter serious problems such as requiring exhaustive search and producing overwhelming numbers of patterns. From the perspective of data analysis, they do not render high prediction accuracy particularly for data with low volume, rare or imbalanced groups, rare cases or biases due to subtle overlapping or entanglement of the statistical and functional associations at the data source level. Hence, Professor Andrew K.C. Wong and I have developed a novel Pattern Discovery and Disentanglement (PDD) Method to discover explicit patterns and unveil knowledge from relational datasets even encompassing imbalanced groups, biases and anomalies. The statistically significant high-order patterns, pattern clusters and rare patterns are discovered in the disentangled Attribute Value Association (AVA) Spaces. They may be embedded in a relational dataset but overlapping or entangled with each other so that they are masked or obscured at the data level. The patterns discovered from the disentangled association source can be used for explicitly interpreting the original data, predicting the functional groups/classes and detecting anomalies and/or outliers. When class labels are not given, pattern/entity clustering can be more effectively discovered from the disentangled attribute value association (AVA) space than from the original records. The objective of this Master Thesis is to develop and validate the efficacy of PDD for genomic and clinical data analysis using a) protein sequence data, b) public clinical records from UCI dataset and c) a clinical dataset obtained from the School of Public Health and Health Systems at the University of Waterloo. The experimental results with superior performance in unsupervised and supervised learning than existing methods are presented in interpretable knowledge representation frameworks, interlinking the AVA disentangled sources, patterns, pattern/entity clusters and individual entities. In the clinical cases, it reveals the symptomatic patterns of individual patients, disease complexes/groups and subtle etiological sources. Hence it will have impacts in machine learning on genomic and clinical data with broad applications.