Nguyen, Olivier2018-08-172018-08-172018-08-172018-08-15http://hdl.handle.net/10012/13603Social media platforms contain large amounts of freely and publicly available data that could be used to measure population characteristics across different geographical regions. Analyzing public data sources such as social media data has shown promising results for public health measures and monitoring. This thesis addresses challenges in building sys- tems that collect high-volumes of data from social media platforms. More specifically, we look at Twitter data processing, filtering, and aggregation to provide population-level in- dicators of physical activity, sedentary behavior, and sleep (PASS). In the first part of the thesis, we go over the whole machine learning pipeline built: (i) Twitter data collection from November 2017 to May 2018; (ii) data preparation through manual annotation, key- word filtering, and an active learning technique for the labelling of 10,283 tweets; and (iii) training a classifier to identify PASS related tweets. Training the model involves building an initial classifier to efficiently find relevant tweets in subsequent annotation iterations. Our classifiers include an ensemble model consisting of several shallow machine learning algorithms, along with deep learning algorithms. In the second part of the thesis, we look at the performance of different solutions. We provide benchmark results for the task of classifying PASS related tweets for the various algorithms considered. We also derive health indicators by aggregating and computing the proportion of classified tweets by province and compare our metrics with the prevalence of obesity, diabetes and mood disorders from the Canadian Community Health Survey. Our work shows how machine learning can be used to complement public health data and better inform health policy makers to improve the lives of Canadians.enPopulation-level Indicators of Physical Activity, Sedentary Behaviour and Sleep in Canada based on TwitterMaster Thesis