Show simple item record

dc.contributor.authorMENG, XIAO
dc.date.accessioned2019-09-19 19:39:23 (GMT)
dc.date.available2019-09-19 19:39:23 (GMT)
dc.date.issued2019-09-19
dc.date.submitted2019-09-16
dc.identifier.urihttp://hdl.handle.net/10012/15084
dc.description.abstractIn many modern data management scenarios, we encounter tasks, operations or computational phases that are data-intensive where the sheer volume of data proves to be overwhelming to handle and becomes a performance bottleneck. For data-intensive tasks, the bottleneck is data loading, where the cost of loading data into memory is more significant than the cost of actual computation. For data-intensive shuffling, the bottleneck is data transfer, where intermediate data are scattered and shuffled for further processing. This thesis addresses two data-intensive scheduling problems: (1) multi-processor scheduling for data-intensive tasks to reduce redundant data loading; (2) reducer scheduling for data-intensive shuffling to reduce redundant data communication. For data-intensive tasks, we focus on workloads with precedence constraints of data dependencies, which are common in various applications such as data analytics and ETL processing. These workloads are often known in advance, are presented as directed acyclic graphs (DAG), and are data-intensive and sensitive to cache misses. We solve the problem of scheduling DAGs of data-intensive tasks on multiple processors or machines, in order to minimize execution time. To do so, we propose scheduling algorithms that take cache misses into account. Simulations and an experimental evaluation using a Spark cluster demonstrate the advantages of our solutions in terms of workload completion time. For data-intensive shuffling, we focus on MapReduce-style processing. Communication overhead is incurred in the Shuffle stage which sends intermediate results from mappers to reducers. We solve this problem: given a collection of mapper outputs (intermediate key-value pairs) and a partitioning of this collection among the reducers, which node should each reducer run on to minimize data transfer? We reduce two natural formulations of this problem to optimization problems for which polynomial solutions exist. We show that our techniques can cut communication costs by 50 percent or more compared to Hadoop’s default reducer placement, which leads to lower network utilization and faster MapReduce job runtimes.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectschedulingen
dc.subjectdatabaseen
dc.subjectdata analyticsen
dc.titleData-intensive Schedulingen
dc.typeMaster Thesisen
dc.pendingfalse
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeMaster of Mathematicsen
uws.contributor.advisorGolab, Lukasz
uws.contributor.affiliation1Faculty of Mathematicsen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages