Greedy Representative Selection for Unsupervised Data Analysis

dc.comment.hiddenThe Permissions section of the thesis includes the required permissions.en
dc.contributor.authorHelwa, Ahmed Khairy Farahat
dc.date.accessioned2013-01-25T15:07:13Z
dc.date.available2013-01-25T15:07:13Z
dc.date.issued2013-01-25T15:07:13Z
dc.date.submitted2012
dc.description.abstractIn recent years, the advance of information and communication technologies has allowed the storage and transfer of massive amounts of data. The availability of this overwhelming amount of data stimulates a growing need to develop fast and accurate algorithms to discover useful information hidden in the data. This need is even more acute for unsupervised data, which lacks information about the categories of different instances. This dissertation addresses a crucial problem in unsupervised data analysis, which is the selection of representative instances and/or features from the data. This problem can be generally defined as the selection of the most representative columns of a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Algorithms for column subset selection can be directly used for data analysis or as a pre-processing step to enhance other data mining algorithms, such as clustering. The contributions of this dissertation can be summarized as outlined below. First, a fast and accurate algorithm is proposed to greedily select a subset of columns of a data matrix such that the reconstruction error of the matrix based on the subset of selected columns is minimized. The algorithm is based on a novel recursive formula for calculating the reconstruction error, which allows the development of time and memory-efficient algorithms for greedy column subset selection. Experiments on real data sets demonstrate the effectiveness and efficiency of the proposed algorithms in comparison to the state-of-the-art methods for column subset selection. Second, a kernel-based algorithm is presented for column subset selection. The algorithm greedily selects representative columns using information about their pairwise similarities. The algorithm can also calculate a Nyström approximation for a large kernel matrix based on the subset of selected columns. In comparison to different Nyström methods, the greedy Nyström method has been empirically shown to achieve significant improvements in approximating kernel matrices, with minimum overhead in run time. Third, two algorithms are proposed for fast approximate k-means and spectral clustering. These algorithms employ the greedy column subset selection method to embed all data points in the subspace of a few representative points, where the clustering is performed. The approximate algorithms run much faster than their exact counterparts while achieving comparable clustering performance. Fourth, a fast and accurate greedy algorithm for unsupervised feature selection is proposed. The algorithm is an application of the greedy column subset selection method presented in this dissertation. Similarly, the features are greedily selected such that the reconstruction error of the data matrix is minimized. Experiments on benchmark data sets show that the greedy algorithm outperforms state-of-the-art methods for unsupervised feature selection in the clustering task. Finally, the dissertation studies the connection between the column subset selection problem and other related problems in statistical data analysis, and it presents a unified framework which allows the use of the greedy algorithms presented in this dissertation to solve different related problems.en
dc.identifier.urihttp://hdl.handle.net/10012/7270
dc.language.isoenen
dc.pendingfalseen
dc.publisherUniversity of Waterlooen
dc.subjectData Miningen
dc.subjectMachine Learningen
dc.subjectUnsupervised Data Analysisen
dc.subjectGreedy Algorithmsen
dc.subjectRepresentative Selectionen
dc.subjectFeature Selectionen
dc.subjectData Clusteringen
dc.subject.programElectrical and Computer Engineeringen
dc.titleGreedy Representative Selection for Unsupervised Data Analysisen
dc.typeDoctoral Thesisen
uws-etd.degreeDoctor of Philosophyen
uws-etd.degree.departmentElectrical and Computer Engineeringen
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Helwa_Ahmed_Khairy_Farahat.pdf
Size:
1.42 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
248 B
Format:
Item-specific license agreed upon to submission
Description: