Show simple item record

dc.contributor.authorPham, Viet Hung
dc.date.accessioned2022-09-07 18:49:50 (GMT)
dc.date.available2022-09-07 18:49:50 (GMT)
dc.date.issued2022-09-07
dc.date.submitted2022-09-01
dc.identifier.urihttp://hdl.handle.net/10012/18717
dc.description.abstractFor the last decade, deep learning (DL) has emerged as a new effective machine learning approach that is capable of solving difficult challenges. Due to their increasing effectiveness, DL approaches have been applied widely in commercial products such as social media platforms and self-driving cars. Such widespread application in critical areas means that mistakes caused by bugs in such DL systems would lead to serious consequences. Our research focuses on improving the reliability of such DL systems. At a high level, the DL systems development process starts with labeled data. This data is then used to train the DL model with some training methods. Once the model is trained, it can be used to create predictions for some unlabeled data in the inference stage. In this thesis, we present testing and analysis techniques that help improve the DL system reliability for all stages. In the first work, CRADLE, we improve the reliability of the DL system inference by applying differential testing to find bugs in DL libraries. One key challenge of testing DL libraries is the difficulty of knowing the expected output of DL libraries given an input instance. We leverage equivalent DL libraries to overcome this challenge. CRADLE focuses on finding and localizing bugs in DL software libraries by performing cross-implementation inconsistency checking to detect bugs, and leveraging anomaly propagation tracking and analysis to localize faulty functions that cause the bugs. CRADLE detects 12 bugs in three libraries (TensorFlow, CNTK, and Theano), and highlights functions relevant to the causes of inconsistencies for all 104 unique inconsistencies. Our second work is the first to study the variance of DL systems training and the awareness of this variance among researchers and practitioners. Our experiments show large overall accuracy differences among identical training runs. Even after excluding weak models, the accuracy difference is 10.8%. In addition, implementation-level factors alone cause the accuracy difference across identical training runs to be up to 2.9%. Our researcher and practitioner survey shows that 83.8% of the 901 participants are unaware of or unsure about any implementation-level variance. This work raises awareness of DL training variance and directs SE researchers to challenging tasks such as creating deterministic DL implementations to facilitate debugging and improving the reproducibility of DL software and results. DL systems perform well on static test sets coming from the same distribution as training sets but may not be robust in real-world deployments because of the fundamental assumption that the training data represents the real world data well. In cases where the training data misses samples from the real-world distribution, it is said to contain blindspots. In practice, it is more likely a training dataset contains weakspots (i.e., a weaker form of blindspots, where the training data contains some samples that represent the real world but it does not contain enough). In the third work, we propose a new procedure to detect weakspots in training data and to improve the DL system with minimum labeling effort. This procedure leverages the variance of the DL training process to detect highly varying data samples that could indicate the weakspots. Metrics that measure such variance can also be used to rank new samples to prioritize the labeling of additional training data that can improve the DL system accuracy when applied to the real world. Our evaluation shows that, in scenarios where the weakspots are severe, our procedure improves the model accuracy on weakspot samples by 25.2% requiring 2% of additional training data. This is an improvement of 4.5 percentage points compared to the traditional single model metric with the same amount of additional training data.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectdeep learningen
dc.subjectsoftware testingen
dc.subjectdeep learning system testingen
dc.titleImproving the Reliability of Deep Learning Software Systemsen
dc.typeDoctoral Thesisen
dc.pendingfalse
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeDoctor of Philosophyen
uws-etd.embargo.terms0en
uws.contributor.advisorYaoliang, Yu
uws.contributor.advisorLin, Tan
uws.contributor.affiliation1Faculty of Mathematicsen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages