Exemplar-based Kernel Preserving Embedding

dc.contributor.advisorKarray, Fakhri
dc.contributor.authorElbagoury, Ahmed
dc.date.accessioned2016-05-02T16:15:56Z
dc.date.available2016-05-02T16:15:56Z
dc.date.issued2016-05-02
dc.date.submitted2016-04-28
dc.description.abstractWith the rapid increase of available data, it becomes computationally harder to extract useful information, specially in the case of high-dimensional data. Choosing a representative subset of the data can be useful to overcome this challenge as these representatives can be used by data analysts or presented to end users to give them a grasp of the data nature and structure. In this dissertation, first an Exemplar-based approach for topic detection is proposed, in which detected topics are represented using a few selected tweets. Using exemplar tweets instead of a set of keywords allows for an easy interpretation of the meaning of the detected topics. The approach is then extended to detect topics that emerge in new epochs of data. Experimental evaluation on benchmark Twitter datasets shows that the proposed topic detection approach achieves the best term precision. It does this while maintaining good topic recall and running times compared to other approaches for topic detection. Moreover, the proposed emerging extension achieves higher topic recall with improved running times when compared to recent emerging topic detection approaches. To overcome the challenge of high-dimensional data, several techniques, like PCA and NMF, were proposed to embed high-dimensional data into low-dimensional latent space. However, data represented in latent space is difficult for data analysts to understand and grasp the information encoded in it. In addition, these techniques do not take the relations between the data points into account. This motivated the development of other techniques like MDS, LLE and ISOMAP which preserve the relations between the data instances, but they still use latent features. In this dissertation, a new embedding technique is proposed to mitigate the previous problems by projecting the data to a space described by few points (i.e., the exemplars) which preserves the relations between the data points. The proposed method Exemplar-based Kernel Preserving (EBEK) embedding is shown theoretically to achieve the lowest reconstruction error of the kernel matrix. EBEK achieves a linear running time complexity in terms of the number of the samples. Using EBEK in the approximate nearest neighbor search task shows its ability to outperform related work by up to 60% in the recall while maintaining a good running time. In addition, empirical evaluation on clustering shows that EBEK achieves higher NMI than LLE and NMF by differences up to 40% and 15% respectively. It also achieves a comparable cluster quality to ISOMAP with a difference up to 3% in NMI and F-measure with a speedup up to 15×. In addition, our interpretability experiments show that EBEK’s selected basis are more understandable than the latent basis in images datasets.en
dc.identifier.urihttp://hdl.handle.net/10012/10435
dc.language.isoenen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectExemplaren
dc.subjectEmbeddingen
dc.subjectKernel Methodsen
dc.subjectTopic Detectionen
dc.titleExemplar-based Kernel Preserving Embeddingen
dc.typeMaster Thesisen
uws-etd.degreeMaster of Applied Scienceen
uws-etd.degree.departmentElectrical and Computer Engineeringen
uws-etd.degree.disciplineElectrical and Computer Engineeringen
uws-etd.degree.grantorUniversity of Waterlooen
uws.contributor.advisorKarray, Fakhri
uws.contributor.affiliation1Faculty of Engineeringen
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Elbagoury_Ahmed.pdf
Size:
1.43 MB
Format:
Adobe Portable Document Format
Description:
Ahmed Elbagoury - Master's thesis

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.17 KB
Format:
Item-specific license agreed upon to submission
Description: