Show simple item record

dc.contributor.authorIbrahim, Rania
dc.date.accessioned2016-04-29 16:01:13 (GMT)
dc.date.available2016-04-29 16:01:13 (GMT)
dc.date.issued2016-04-29
dc.date.submitted2016-04-29
dc.identifier.urihttp://hdl.handle.net/10012/10424
dc.description.abstractReal time topic detection in Twitter streams is an important task that helps discovering natural disasters in a real time from users’ posts and helps political parties and companies understand users’ opinions and needs. In 2014 the number of active users on Twitter is reported to be more than 288 million users who are posting around 500 million tweets daily. Therefore, detecting topics from Twitter streams in a real time becomes a challenging task that needs scalable and efficient techniques to handle this large amount of data. In this work, we scale an Exemplar-based technique that detects topics from Twitter streams, where each of the detected topics is represented by one tweet (i.e, exemplar). Using exemplar tweets to represent the detected topics, makes these topics easier to interpret as opposed to representing them by uncorrelated terms as in other topic detection algorithms. The approach is implemented using Apache Giraph and is being extended here to efficiently support sliding windows. Experimental results on four datasets show that the optimized Giraph implementation achieves a speedup of up to nineteen times over the native implementation, while maintaining good quality of the detected topics. In addition, Giraph Exemplar-based approach achieves the best topic recall and term precision against K-means, Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and Latent Semantic Analysis (LSA), while maintaining a good term recall and running time. The approach is also deployed for detecting topics from real-time Twitter streams and its scalability is demonstrated. Moreover, another clustering technique called Local Variance-based Clustering (LVC) is proposed in this thesis for detecting topics from Twitter streams. Local Variance-based Clustering (LVC) defines the data points densities based on their similarities. The proposed local variance measure is calculated based on the variance of the data points similarity histogram and is shown to well distinguish between core, border, connecting and outliers points. Experimental results show that LVC outperforms spectral clustering and affinity propagation in clustering quality using control charts, Ecoli and images datasets, while maintaining a good running time. In addition, results show that LVC can detect topics from Twitter with higher topic recall by 15% and higher term precision by 3% over DBSCAN.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectTwitter Streamsen
dc.subjectClusteringen
dc.subjectScalable Topic Detectionen
dc.titleScalable Topic Detection Approaches fromTwitter Streamsen
dc.typeMaster Thesisen
dc.pendingfalse
uws-etd.degree.departmentElectrical and Computer Engineeringen
uws-etd.degree.disciplineElectrical and Computer Engineeringen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeMaster of Scienceen
uws.contributor.advisorKarray, Fakhri
uws.contributor.affiliation1Faculty of Engineeringen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages