Show simple item record

dc.contributor.authorElgohary, Ahmed
dc.date.accessioned2014-02-18 21:01:59 (GMT)
dc.date.available2014-02-18 21:01:59 (GMT)
dc.date.issued2014-02-18
dc.date.submitted2014-02-14
dc.identifier.urihttp://hdl.handle.net/10012/8262
dc.description.abstractThere is an increasing demand from businesses and industries to make the best use of their data. Clustering is a powerful tool for discovering natural groupings in data. The k-means algorithm is the most commonly-used data clustering method, having gained popularity for its effectiveness on various data sets and ease of implementation on different computing architectures. It assumes, however, that data are available in an attribute-value format, and that each data instance can be represented as a vector in a feature space where the algorithm can be applied. These assumptions are impractical for real data, and they hinder the use of complex data structures in real-world clustering applications. The kernel k-means is an effective method for data clustering which extends the k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. This thesis defines a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Then, three practical methods for low-dimensional embedding that adhere to our definition of the embedding family are proposed. Combining the proposed parallelization strategy with any of the three embedding methods constitutes a complete scalable and efficient MapReduce algorithm for kernel k-means. The efficiency and the scalability of the presented algorithms are demonstrated analytically and empirically.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectData Clusteringen
dc.subjectKernel Methodsen
dc.subjectScalable Data Analyticsen
dc.subjectMapReduceen
dc.subjectBig Dataen
dc.titleScalable Embeddings for Kernel Clustering on MapReduceen
dc.typeMaster Thesisen
dc.pendingfalse
dc.subject.programElectrical and Computer Engineeringen
uws-etd.degree.departmentElectrical and Computer Engineeringen
uws-etd.degreeMaster of Applied Scienceen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages